Luận văn parallel texts extraction from the web

This approach usually uses lexicon translations getting from a bilingual dictionary to measure the similarity of content of the two texts.. The most important contribution of our work is

Trang 1

from the Web

[LÊ (Warr

by

Le Quang Hung

Faculty of Information Technology

University of Engineering and Technology Vietnam National University, Hanoi

Supervised by

Dr Le Anh Cuong

A thesis submitted in fulfillment of the requirements for

the degree of Master of Information Technology

December, 2010

Trang 2

1.1 Parallel corpus and its role 1

3Š StmMNHY ; ¿rác ¿Z3 677350 S65 WEG BUSS DRS SBS 15

3.1.2 Content-based filtering module 18

Ỷ 1 The method based on cognation 20

2 The method based on identifying translation seg-

3.1.3 Structure analy 28

Trang 4

List of Figures

1.1 An example of English-Vietnamese parallel texts -.- 2

The algorithm of translation pairs finder [3]

Architecture of the PTI system [4|

An example of the two links in the text

3.1 Architecture of the Parallel Text Mining system 6.0.00 6 17

3.4 Description of the process content-based filtering module 20 3.5 Au example of two corresponding texts of English and Vietnamese 22 3.6 The algorithm measures similarity of cognates between a texts pair

TOE EPO bes ae ee eee SỈ Hew! Boe eae 25

3.9 Identifying translation paragraphs -.- 37

3.10 A sample code written in Java to perform translation from English

into Vietnamese via Google AJAX APIL 3r 3.11 Web documents and the source HTML code for two parallel tran:

3.12 An example of the publication date fesbine is rieeteavted from a

HEM, PaO 2: occ oc eee ees Hh WR EW I RGNE 66 Bướn ® aoe 30

3.13 Classification model 00 eee 31

4.1 Figure for precision and recall measures 2 oo ee 32 4.2 The format of training and testing data 34

4.3 Performance of identifying translation segments method 38

44 Comparison of the methods ee 39

Trang 5

1.1 Enroparl parallel corpus; 10 aligned language pairs all of which

4 40 Overall results of each method (P-Pr

URLs from three sites: BBC, VOA News and VietnamPlus 33

No pages downloaded and No candidate pairs

Structure-based method

Content-based method

Method based on cognation

Combining structural features and cognate information,

Identifying translation at document level

Trang 6

Chapter 1

Introduction

In this chapter, we first introduce about parallel corpus and its role in NLP ap-

and contributions are then

presented Finally, the thesis’ structure is shortly des

Parallel text

Different definitions of the term “parallel text” (also known as bitext) can be

found in the literature As common understanding, a parallel text is a text in

one language together with its translation in another language Dan Tufis [5]

gives a definition: “parallel text is an association between two texts in different

languages that represent translations of each other”, Figure 1.1 shows an example

of English-Vietnamese parallel texts

Parallel corpus

A parallel corpus is a collection of parallel texts, According to [0], the simplest case is where two languages only are involved, one of the corpora is an exact translation of the other (e.g., COMPARA corpus [7]) However,

corpora exist in several languages For instance, Europar! parallel corpus [8] which

me parallel

includes versions as report in Table 1.1, In addition, the

Trang 7

Ti] in the book — an early example of the | [1] Cuốn sich lavi dy đầu tiên của thể loại truyện slave narrative genre Equiano gives an | kẻ về người nô lệ trong đỏ Equiano kệ về mánh account of his native land and the horrors | đất quê hương mỉnh, những nỗi hãi hằng và sự and crueHi of his captivity and | tam dc ma ong phải chịu đựng rong thời giam ông enslavement in the West Indies bị cảm giữ và lam né 1g 6 West Indies

[2] Equiano an Ibo fom Niger (West|[2] Equiano mot nguoi Ibo din te Niger (lay Africa), was the first black in America to | Phi), là nguời Da den dau tiên ở Mỹ viết một tiêu write an autobiogmphy The Interesting | sử tự thuật mang tên The Interesting Narrative of

Narrative of the Life of Olaudah Equiano | the Life of Olaudah Equiano, or Gustavus V'assa,

ot Gustavus Vassa, the African (1789), | the Affican (Ty thuat ha vj vé cuge doi cla

Olaudah Equiano hay Gustavus Vassa, newoi chau Phi - 1789)

[3] (1745-e 1797) Important black writers | [3] Olaudah Equiano (Gustavus Vasa) (Khoảng

like Olandah Equiano and Jupiter Hammon | 1745 - khoảng 1797) Hai tác giả Da đen có tim emerged during the colonial period quan trong va néi bật trong thời kỳ thuộc địa là

Olandah Equiano va Tupiter Hammon

[3] imitative of English Hterary fashions, | [3] Bat chước những trảo lưu vấn học Anh, người the southemers attained imaginative | miền Nam thể hiện trí tưởng tượng phong phủ heights in witty precise observations of | trong sv quan sảt sắc sảo và thông mình chỉnh distinctive New World conditions Olaudah | xác những điều kiện sống đặc thù của Tân thể

Equiano (Gustavus Vassa) lới

[5] In general the colonial South may Em: Nhin chung văn học thuộc địa miến Nam cỏ

fairly be linked with a light, w thể nói là cô mỗi liên hệ phân nảo với truyền informative, and realistic literary tradition | théng van hoc hién thực, gẵn gũi với cuộc sống,

mang tinh phỏ quát và chữa nhiều thông tin

EIGURE 1.1: An example of English-Vietnamese parallel texts,

corpus may have been translated from language Ly to language Ly and others the

other way around The direction of the translation may not even be known

The parallel corpora exist in several formats, They can be raw parallel texts or they can be aligned texts, The texts can be aligned in paragraph level, sentence

level or even in phrase level and word level The alignment of the texts is useful for

semantically equivalent components of the parallel texts as words, phrases, sen-

tences are useful for bilingual dictionary construction [14, 15] The parallel texts are also used for acquisition of lexical translation [16] or word sense disambiguation

[17] For most of the mentioned tasks, the parallel corpora are currently playing

a crucial role in NLP applications

Trang 8

= 27 468.389

47,000,805

42,810,628

Portuguese-English Swedish-English 1570.411

processing For that reason, many studies [1-4, 18-22] are paying their attention

in mining parallel corpora from the Web Basically, we can classify these studies

into three groups: content-based (CB) [3, 4, 22], structure-based (SB) [1, 2, 18]

and hybrid (combination of the both methods) [19-21]

The CB approach uses the textual content of the parallel document pairs being

evaluated This approach usually uses lexicon translations getting from a bilingual

dictionary to measure the similarity of content of the two texts When the bili

gual dictionary is available, documents are translated word by word to the target

language The translated documents then are used to find the best matching par-

allel documents by applying similarity scores functions such as cosine, Jaccard, Dice, ete However, using bilingual dictionary may face difficulty because a word

usually has many its translations

Meanwhile, the SB approach relies on analysis HTML structure of pages This

approach uses the hypothesis that parallel web pages are presented in similar structures The similarity of the web pages are estimated based on the structural

HTML of them Note that this approach does not require lingnistical knowledge

Trang 9

In addition, this approach is very effective in filtering a big number of unmatched

Web, the structure of pages is similar but the content of them is different For

that reason, HTML structure-based approach is not applicable in some eases

As we have introduced, the parallel corpus is the valuable resource for different

NLP tasks Unfortunately, the available parallel corpora are not only in relatively

small size, but also unbalanced even in the major languages [3] Some resources are available, such as for English-French, the data are usually restricted to gov-

ernment documents (e.g., the Hansard corpus) or newswire texts The others are

limited availability due to licensing restrictions as [23] According to [24], there are

now some reliable parallel corpora: Hansard Corpus!', JRC-Acquis Parallel Cor-

There are a few studies of mining parallel

corpora from the Web, one of them is pres

big motivation for many studies on this work

The objective of this research is extracting parallel texts from bilingual web sites

translation segments Then, we combine content-based features with structural

features under a framework of machine learning

Thttp://www.isi.edu/natural-language/download /hansard/

2http:/ /langtech.jrc.it /JRC-Acquis html

Shttp://www.statmt.org/europarl/

“http://www.linguateca.pt/COMPARA/

Trang 10

raging by [20] we formulate this problem as classification problem to

utilize as much as possible the knowledge from structural information and the

similarity of content The most important contribution of our work is that we

proposed two new methods of designing content-based features and combined with

structural-based features to extract parallel texts from bilingual web sit

e The first method based on cognation It is worth to emphasize that different

from previous studies [2, 20], we use cognate information replace of word by

word translation From our observation, when translating a text from one

language to another, some special parts will be kept or changed in a little

These parts are usually abbreviation, proper noun, and number We also use other content-based featur ch as the length of tokens, the length of

paragraphs, which also do not require any linguistically analysis It is worth

to note that by this approach we do not need any dictionary thus we think

it can be apply for other language pairs

e The second method based on identifying translation segments use to match

translation paragraphs That will help us to extract proper translation units

in bilingual web pages Previous studies usually use lexicon translations

getting from a bilingual dictionary to measure the similarity of content of the two te

such as in [4, 20] This approach may face difficulty because

a word usually has many its translations Differently, we use the Google

translator because by using it we can utilize the advantages of a statistical

machine translation It helps to disambiguating lexical ambiguity, translating phrases, and reordering

Given below is a brief outline of the topics discussed in next sections of this thesis:

Chapter 2 - Related works

The studies that have close relations with our work are introduced in this chapter.

Trang 11

Chapter 3 - The proposed approach

We show our proposed model, including the general architecture of the model, how

structural features and content-based fealures are designed and estimated

Chapter 4 - Experiment

This clupler evaluates the goodness and effectiveness of our proposcd invthod for extracting parallel texta from the Web The performance of aur proposed and

baseline are presented in here

Chapter 5 - Conclusion and Future works

Final conclusions about our work us u whole and Une evaluation of the results in particular are presented, followed by suggestions of possible future work that could

be done

Finally, references introduce researches that are closely related to our work.

Trang 12

Chapter 2

Related works

In this chapter we outline the general framework in building parallel corpus Then,

we review the studies that have close relations with our work

FIGURE 2.1: General architecture in building parallel corpus

Trang 13

In general, there are two approaches in building the parallel corpus (illustrated in Figure 2.1), The first one is automatically collect bilingual documents from the

to extract parallel texts (the detail of this task is presented in the next sections)

The other one based on the monolingual corpora [25] As scen from the diagram, starting with two large monolingual corpora (a non-parallel corpus) divided into documents, this approach is composed of three steps: (1) selecting pairs of sim-

ilar documents, (2) from each such pair, generate all possible sentence pairs and

pass them through a simple word-overlap-based filter, thus obtaining candidate

sentence pairs, and (3) the candidates are presented to a maximum entropy (ME)

classifier that decides whether the sentences in each pair are mutual translations

The Original STRAND is an architecture for

nition, acquiring natural data Its goal is to identify pairs of web pages that are

uctural translation recog-

mutual translations In order to do this, it exploits an observation about the way

that web page authors disseminate information in multiple languages: When pre-

senting the same content in two different languages, authors exhibit a very strong

tendency to use the same document structure The STRAND therefore locates

pages that might be translations of each other, via a number of different strate-

gies, and filters out page pairs whose page structures diverge by too much The

STRAND architecture has three basic steps (illustrated in Figure 2.2):

Trang 14

Chapter 2 Related works 9

Candidate Pair

Generation © [7] Evaluation, «© [>] language dependent)

(stracturdl)

Translation Đan

FIGURE 2.2: The STRAND architecture [1]

e Generation of candidate pairs that might be translations, and

« Structural filtering out of nontranslation candidate pairs

The heart of STRAND is a structural filtering process that relies on analysis of the pages’ underlying HTML to determine a set of pair-specific structural values, and then uses those values to decide whether the pages are translations of one

ize the HTML structure and content of the documents

another The first step in this process is to lines

ignore the actual lingni

Both documents in the candidate pair are run through a markup analyzer that

acts as a transducer, producing a linear sequence containing three kinds of token:

[START:clement label] ¢.g., [START:H3]

[END:elementJabel] eø., [END:H3]

Trang 15

The socond step is to align the linearized sequences using a standard dynamic programming technique For example, consider two documents that begin as Fig-

Ficure 2.3: An example of aligning two documents

Using this alignment, the authors compute four values from the aligned strue- tures which indicate the amount of non-shared material, the number of aligned non-markup text chunks of unequal length, the correlation of lengths of the aligned non-markup chunks, and the significance level of the correlation Machine learning, namely decision trees, are then used for filtering, based on these four values

PTMiner system [2] works on extracting bilingual English-Chinese documents, This

tem uses a search engine to locate for host containing the par-

allel web pages In order to generate candidate pairs, the PTMiner uses a URL- matching process (¢.g., Chinese translation of a URL as “http://www.XXXX.com/ -/eng/ ehtmml" might be “http://www.XXXX.com/ /chi/ c.html”) and other

Trang 16

Chapter 2 Related works "

FiGuRE 2.4: The workflow of the PTMiner s tem [2]

‘The PTMiner implements the following steps (illustrated in Figure 2.4):

1 Search for candidate siti

Using existing Web search engines, search for the candidate sites that may contain parallel pages

Ss Filename fetching - For each candidate site, fetch the URLs of Web pages

that are indexed by the search engines

3 Host crawling - Starting from the URLs collected in the previous step, search

through each candidate site separately for more URLs

4 Pair scan - From the obtained URLs of each site,

can for possible parallel pairs

5 Download and verifying - Download the parallel pages, determine file size,

language, and character set of each page, and filter out non-parallel pairs

In experiment, several hundred selected pairs were evaluated manually Their

re

ilts were quite promising, from a corpus of 250 MB of English-Chinese text,

statistical evaluation showed that of the pairs identified, 90% were correct.

Trang 17

2.3 Content-based methods

The approach discussed thus far relies heavily on document structure However,

as Ma and Liberman [3] point out not all translators create translated pages that look like the original page Moreover, structure-based matching is applicable only

in corpora that include markup, and there are certainly multilingual collections on

the Web and elsewhere that contain parallel text without structural tags All these considerations motivate an approach to matching translations that pays attention

to similarity of content, whether or not similarities of structure exist In this

Bilingual Internet Text Search (BITS) [3],

Parallel Text Identification (PTI) [4], and Dang’s system [22]

measured as an algorithm in Figure 2.5 Then finding the B which is mo milar

to A, if the similarity between A and B is greater than a given threshold t, then A and B are declared a translation pair The similarity between A and B is defined

as

Number of translation token pairs

simyAB) = Fr onler of tokens in teal A

In experiment, Ma and Liberman use an English-German bilingual lexicon of

117,793 entries The authors report 99.1% precision and 97.1% recall on a hand-

picked set of 600 documents (half in each language) containing 240 translation pairs (as judged by humans)

The PTI system (illustrated in Figure 2.6) crawls the Web to fetch parallel

multilingual Web documents using a Web spider To determine the parallelism

between potential bilingual document pairs, two different modules are developed

A filename comparison module is used to check filename resemblance A content,

analysis module is used to measure the degree of semantic similarity It incor-

porates a novel content-based similarity scoring method for measuring the degree

of parallelism for every potential document pair based on their semantic content

Trang 18

£

= endfor

if max_sim > t then

atput (A, 5) endif

endfor

Figure 2.5: The algorithm of translation pairs finder [3]

using a bilingual wordlist ‘The results showed that the PTI system achieves a precision rate of 93% and a recall rate of 96% (180 instances is

total of 193 pairs extracted)

a See | Plmame | Documents | conte | Documents

er =| Compancon >} Anaivas > Discaré

In our knowledge, there are rarely studies on this field related to Vietnamese,

[22] built an English-Vietnamese parallel corpus based on content-based matching,

Firstly, candidate web page pairs are found by using the features of sentence length

and date Then, they measure similarity of content using a bilingual English-

Trang 19

for parallel pages that are good translations of each other and they are required being written in the same style, Moreover, using word by word translation will cause much ambiguity Therefore, this approach is difficult to extend when the data increases as well as when applying for bilingual web sites

with various styles

Another instance of this approach is that instead of using bilingual dictionary,

a simple word-based statistical machine translation is used to translate texts in one language to the other [26] uses this method to build an English-Chinese

parallel corpus from a huge text collection of Xinhua Web bilingual news corpora

collected by LDC! By adding newly built parallel corpus to their existing corpus,

they reported an increase in the translation quality of their word-based statistical machine translation in terms of word alignment A bootstrapping approach [27]

can also be applied to incrementally increase number of both parallel sentences

and bilingual lexical vocabulary:

The last version of STRAND [20] is another well-known web parallel text mining system Its goal is to identify pairs of web pages that are mutual translations, They used the AltaVista search engine to search for multilingual web sites and generated candidate pairs based on manually created substitution rules The heart

of STRAND is a structural filtering process that relies on analysis of the pages underlying HTML to determine a set of pair-specific structural values, and then uses those values to filter the candidate pairs This system also proposes a new method

that combines content-based and structure matching by using a cross-language

similarity score as an additional parameter of the structure-based method A

translation lexicon is used to link tokens between pairs of parallel document The link be a pair (x, y) in which x is a word in language Ly and y is a word in Ly

An example of two texts with links is illustrated in Figure 2.7 Using the results

of MCBM? they defined a ¢sim translational similarity measure as

Number of two-word links in best matching

tsim Number of links in best matching (2.2)

Linguistic Data Consortium, at http://www.lde.upenn.edu/

2Problem of maximum cardinality bipartite matching

Trang 20

Vie They plow the paddy fields and pull a curt

| i Nutt Z2 NULL

V32:H@ cây ruộng lúa và kếo xe

FIGURE 2.7: An example of the two links in the text

In experiment, approximately 400 pairs were evaluated by human annotators The STRAND produced fewer than 3500 English-Chinese pairs with a precision of 98% and a recall of 61%

In others systems, [19] proposed a method that combining length-base and

a bilingual page [28] proposed a similar approach The author presents a system that automatically collects bilingual texts from the Internet The criteria for parallel text detection is based on the size, HTML structures and word by word

translation model

In this chapter, we presented related works for mining parallel corpus from the Web The content-based approach usually uses a bilingual dictionary to mateh pairs of word-word in two languages Meanwhile, structure-based approach relies

on analysis HTML structure of pages In the real implementation, both approaches

are usually employed to get good performance Generally, the structure-based

Trang 21

The proposed approach

In this chapter, we introduce our proposed model, including the general architee- ture of the model, how structural features and content-based features are designed

and estimated We also represent the classification modeling in our sy

In this work, our proposed approach whose it is combination content-based features and structure-based features of the HTML pages to extract parallel texts from the Web by using machine learning [20] The machine learning algorithm used hei

Support Vector Machine (SVM) Figure 3.1 illustrates the general architecture of

s shown in the model it includes the following tasks:

y, we use a crawler on the 5)

scified domains to extract bilingual English-

Vietnamese pages which are called raw data

« Secondly, from the raw data, we will create candidates of parallel web pages

by using some threshold of extracted features (content-based features and the feature about date)

data It means that we will obtain some pairs of parallel web pages which

ich

are assigned with label 1, and some other pairs of parallel web pages w!

are assigned with label 0 (the detail of this task is presented in the experiment,

section)

16

Trang 22

Chapter 3 The proposed approach 17

Ficur Architecture of the Parallel Text Mining system

e Fourthly, we will extract structural features and content-based features so

that each web page pair can be represented as a vector of these features

This representation is required to fit a classification model

a SVM tool to train a classification system on this training

to as Web crawlers or spiders, In general terms, the working of a Web crawler is

as Figure 3.2 A typical Web crawler, starting from a set of seed pages, locates new pages by parsing the downloaded pages and extracting the hyperlinks (in short links) within, Extracted links are stored in a FIFO fetch quene for further

Trang 23

‘World Wide Web

Web pages

‘Text and metadata

FiGuRE 3.2: Architecture of a standard Web crawler

retrieval, Crawling continues until the fetch queue gets empty or a satisfactory

number of pages are downloaded

In our work, bilingual English-Vietnamese web pages are collected by crawling

the Web using a Web spider as in [4] To execute this process, our system uses the

Teleport-Pro! to retrieve web pages from remote web sites Teleport-Pro is a tool

designed to download the documents on the Web via HTTP and FTP protocols

and store the extracted data in disk [3] Note that, we select the URLs on the

specified hosts from the three news sites: BBC, VietnamPlus, and VOA News

For example, the URL on the BBC site for English is ” http: //www.bbe

*http://www-bbe.co.uk/vietnamese/” for Vietnamese Then, we use Teleport-Pro

'o,uk” and

to download the HTML pages for obtaining the candidate web pages

3.1.2 Content-based filtering module

The HTML pages are converted to plain text after they are retrieved from remote

web sites Note that, the original web pages usually contain less user interface

components such as JavaScript, Flash, ete So, we use a simple script to clean

them and extract only text when content-based matching

"http: //www.tenmax.com/teleport /pro/home.htm

Trang 24

“Ác 22Ef0gĐ4a he đà tế nhIA De tông VS0 e gà

‘Diets Eathg ite bing 209 hata tea Ti thiệt Tà ent Straus any rt 2 ctrl gái

‘Tonge he meg 10 ong Mum in ce cea ego sin mbt tle b mee {Pica choy 34) 0 ib 320900 tn ft Bode Ton seg one

‘Bio cubs eo share e900 0} ie oma ens cuseng Zang dang aD sities vecinoung aan la!

moe [Benne nn

¬—

lên G3 Eendteotreed Print cy Sere Onions

Steel prices on the rise

‘Steel poces nas seon a sharp increase öf VHIO300 066

‘er tonne in te past few cays

“This spe ts ata to auton ctthe US cor aed Ihe inceing cuirand fee sleet lh Movember and

‘cernbar, sal Ps Inst for Mast and Pree uae oe intr eta ana Tease

nme nest nam ct Osteber, stat cantumetion eaenee orth 383.009 formes, Coan By 200,000 formes cơprkd lo Setar

{95 ew orate

FIGURE 3.3: An example of a candidate pair

As common understanding using content-based features we want to determine

whether two pages are mutual translation, However, as [3] pointed out, not all

translators create translated pages that look like the original page Moreover, structure-based matching is applicable only in corpora that include markup, and there are certainly multilingual collections on the Web and elsewhere that contain

parallel text without structural tags [20] Many studies have used this approach

to build a parallel corpus from the Web such as [4, 22] They use a bilingual dictionary to measure the similarity of the contents of two texts However, this

Trang 25

method can cause much ambiguity because a word usually has many its translations For English-Vietnamese, one word in English can correspond to multiple

words in Vietnamese To overcome this limitation, we propose two new methods

ss content-based filtering module,

3.1.2.1 The method based on cognation

This method use cognate information, which provides a cheap, and reasonable re-

source, This proposal is based on an observation that a document usually contains

some cognates and if two documents are mutual translations then the cognates are

usually kept the same in both of them The cognates are words that are spelled similarly in two languages, or words that simply are not translated (e.g., abbreviations) For example, if the word “ITO” appears in an English text, it probably also appears as “WTO” in a Vietnamese text Note that, [30] also use cognates but for sentence alignment We divide a token which is considered as a cognate

into three types as follows:

1 The abbreviations (¢.g., “EU”,

2 The proper nouns in English (e.g ”, “Paris”), and

Trang 26

estimated by the rate between the number of corresponding,

cognates between the two texts and the number of tokens in one text (e.g., for

English text) Given a pair of texts (Etext, Viet) where Btext stands for English and Vteat stands for Vietnamese, we respectively obtained the token set of cognates T; and Tỷ from Etect and Vtezt For a robust matching between cognates,

we make some modifications of the original token:

¢ A mmber, which is written as sequence of letters in the English alphabet,

is converted into a real number According to our observations, units of the numbers in English are often retained when translated into Vietnamese, So,

we do not consider in case the units are different (c.g., inch vs cm, pound vs

kg, USD vs VND, ete)

We use a list which contains the corresponding names between English and

Vietnamese They include names of countries, continents, date, ete How-

ever, the names of countries in English can be translated into Vietnamese in different ways Therefore, we only consider these names in English, which

Vietnamese names corresponding have been published on Wikipedia site”

Figure 3.5 is an example of two corresponding texts of English and Vietnamese

we obtained T;

of cognates between Etext and Vtext by using the algorithm presented in Figure

3.6

If Sitticognates(Etext, Vtewt) greater than a threshold then the pair (Etext.Vtext)

is a candidate The siméoynates(Etext, Viext) is calculated as in formula (3.1)

Tiêu đề	Parallel Texts From Extraction The Web
Tác giả	Le Quang Hung
Người hướng dẫn	Dr. Le Anh Cuong
Trường học	University of Engineering and Technology Vietnam National University
Chuyên ngành	Information Technology
Thể loại	thesis
Năm xuất bản	2010
Thành phố	Hanoi

Định dạng
Số trang	53
Dung lượng	2,2 MB