DSpace at VNU: Extracting parallel texts from the web

Extracting Parallel Texts from the WebLe Quang Hung Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com Le Anh Cuong University of Engineering and Tech

Trang 1

Extracting Parallel Texts from the Web

Le Quang Hung

Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com

Le Anh Cuong

University of Engineering and Technology Vietnam National University, Hanoi Email: cuongla@vnu.edu.vn

Abstract— Parallel corpus is the valuable resource for some

important applications of natural language processing such as

statistical machine translation, dictionary construction,

cross-language information retrieval The Web is a huge resource

of knowledge, which partly contains bilingual information in

various kinds of web pages It currently attracts many studies

on building parallel corpora based on the Internet resource.

However, obtaining a parallel corpus with high accuracy is

still a challenge This paper focuses on extracting parallel texts

from bilingual web-sites of the English and Vietnamese language

pair We first propose a new way of designing content-based

features, and then combining them with structural features under

a framework of machine learning In the experiment we obtain

88.2% of precision for the extracted parallel texts.

I INTRODUCTION Parallel corpus has been used in many research areas of

natural language processing For example, parallel texts are

used for connection between vocabularies in cross language

information retrieval [5], [6], [9] Moreover, extracting

seman-tically equivalent components of the parallel texts as words,

phrases, sentences are useful for bilingual dictionary

construc-tion [1] and statistical machine translaconstruc-tion [2], [7] However,

the available parallel corpora are not only in relatively small

size, but also unbalanced [15] even in the major languages

Along with the development the Internet, World Wide Web is

really a huge database containing multi-language documents

thus it is useful for bilingual texts processing

Up to now, some systems have been built for mining

parallel corpus These studies can be divided into three main

kinds including content-based (CB), structure-based (SB) and

combination of the both methods For CB approach, [3], [13],

[14] uses a bilingual dictionary to match pairs of word-word in

two languages Meanwhile, the SB approach [11], [12] relies

on analysis HTML structure of page Other studies such as [4],

[10] have combined the two methods to improve performance

of their systems

Parallel web pages in a site in general speaking have

comparable structures and contents Therefore, a big number

of these studies focuses on finding characteristics of HTML

structures such as URL links, filename, HTML tags PTMiner

such as size, date, ect Note that this criterion does not appear in most of the bilingual English-Vietnamese web sites STRAND [10] has a similar approach to PTMiner except that

it handles the case where URL-matching requires multiple substitutions This system also proposes a new method that combines content-based and structure matching by using a cross-language similarity score as an additional parameter of the structure-based method

In our knowledge, there is rarely studies on this field related

to Vietnamese [14] built an English-Vietnamese parallel cor-pus based on content-based matching Firstly, candidate web page pairs are found by using the features of sentence length and date Then, they measure similarity of content using a bilingual English-Vietnamese dictionary and making decision that whether two papers are parallel based on some thresholds

of this measure Note that this system only searches for parallel pages that are good translations of each other and they are required being written in the same style Moreover using word-word translation will cause much ambiguity Therefore this approach is difficult to extend when the data increases as well

as when applying for bilingual web sites with various styles

In this paper, we aim to automatically extracting English-Vietnamese parallel texts from bilingual web-sites of news

As encouraging by [10] we formulate this problem as classification problem to utilize as much as possible the knowledge from structural information and the similarity of content It is worth to emphasize that different from previous studies [3], [10] we use cognate information replace of word-word translation From our observation, when translating a text from one language to the another, some special parts will be kept or changed in a little These parts are usually abbreviation, proper noun, and number In addition, we also use other content-based features such as the length of tokens, the length of paragraphs, which also do not require any linguistical analysis It is worth to note that by this approach we do not need any dictionary thus we think it can be apply for other language pairs Our experiment is conducted on the web sites containing English-Vietnamese documents, including BBC (http://www.bbc.co.uk),

2010 Second International Conference on Knowledge and Systems Engineering

Trang 2

experiment in which we will implement different feature sets.

Finally, conclusion is derived in section IV

II THEPROPOSEDMODEL

In this paper, we follow the approach which combines

content-based features and structure-based features of the

HTML pages to extract parallel texts from the Web by using

machine learning [10] The machine learning algorithm used

here is Support Vector Machine (SVM) Figure 1 illustrates

the general architecture of our proposed model As shown in

the model it includes the following tasks:

• Firstly, we use a crawler on the specified domains to

extract bilingual English-Vietnamese pages which are

called raw data

• Secondly, from the raw data, we will create candidates of

parallel web pages by using some threshold of extracted

features (content-based features and the feature about

date)

• Thirdly, we manually label these candidates and then we

have a training data It means that we will obtain some

pairs of parallel web pages which are assigned with label

1, and some other pairs of parallel web pages which are

assigned with labeled 0 (the detail of this task is presented

in the experiment section)

• Fourthly, we will extract structural features and

content-based features so that each web page pair can be

repre-sented as a vector of these features This representation

is required to fit a classification model

• Finally, we use a SVM tool to train a classification system

on this training data It means that if we have a pair of

English-Vietnamese web pages for test, then the obtained

classification will decide whether it is parallel or not

A Host crawling

Bilingual English-Vietnamese web pages are collected by crawling the Web using a Web spider as in [4] A Web spider is

a software tool that traverses a site to gather web pages by fol-lowing the hyperlinks appeared in the web pages To describe this process, our system uses the Teleport-Pro to retrieve web pages from remote web-sites Teleport-Pro is a tool designed to download the documents on the Web via HTTP and FTP proto-cols and store the extracted data in disk [15] Note that we se-lect the URLs on the specified hosts from the three news sites: BBC, VietnamPlus and VOA News For example, the URL on the BBC site for English is: "http://news.bbc.co.uk/english/" and "http://www.bbc.co.uk/vietnamese/" for Vietnamese We then use Teleport-Pro to download the HTML pages for obtaining the candidate web pages

B Content-based Filtering Module

As common understanding, using content-based features we want to determine whether two pages are mutual translation However, as [15] pointed out, not all translators create trans-lated pages that look like the original page Moreover, SB matching is applicable only in corpora that include markup, and there are certainly multilingual collections on the Web and elsewhere that contain parallel text without structural tags [10] Many studies have used this approach to build a parallel corpus from the Web such as [4], [14] They use a bilingual dictionary

to measure the similarity of the contents of two texts However, this method can cause much ambiguity because a word has many its translations For English-Vietnamese, one word in English can correspond to multiple words in Vietnamese In this paper, we propose a different approach which provides a cheap and reasonably resource This proposal is based on an observation that a document usually contain some cognates and if two documents are mutual translations then the cognates are usually kept the same in both of them1 Note that [8] also use cognates but for sentence alignment From our observation,

we divide a token which is considered as a cognate into the three kinds as follows:

1) The abbreviations (e.g "EU", "WTO") 2) The proper nouns in English (e.g "Vietnam", "Paris") 3) The numbers (e.g "10", "2010")

Now, we can design a feature for measuring content sim-ilarity based on cognates This feature is computed by the rate between the number of corresponding cognates between the two texts and the number of tokens in one text (e.g for English text)

Given a pair of texts (A, B) where A stands for English and

B stands for Vietnamese Then, we respectively obtained the token set of cognates T1 and T2 from A and B For a robust matching between cognates we make some modifications of the original token:

• A number which is written as sequence of letters in the English alphabet is converted into a real number

Accord-1 Cognates in linguistics are words that have a common etymological origin (http://en.wikipedia.org/wiki/Cognate)

Trang 3

ing to our observations, units of the numbers in English

are often retained when translated into Vietnamese So,

we do not consider in case the units are different (e.g

inch vs cm, pound vs kg, USD vs VND, etc)

• We use a list which contains the corresponding names

between English and Vietnamese They include names of

countries, continents, date, ect However, the names of

countries in English can be translated into Vietnamese

in different ways Therefore, we only consider these

names in English, which Vietnamese names

correspond-ing have been published on Wikipedia site2 For example,

"Mexico" in English vs "Mêhicô" or "Mễ Tây Cơ" in

Vietnamese are the same

The following is an example of two corresponding texts of

English and Vietnamese:

• Vietnam and Italy through three cooperation programmes

beginning in 1998 have so far signed more than 60

projects on joint scientific research Of the figure, 40

projects have been carried out and brought good results.

• Từ 1998, đến nay, Việt Nam và Italy đã kí kết hơn 60 dự

án hợp tác nghiên cứu chung, có khoảng 40 dự án được

triển khai thực hiện và đạt được kết quả tích cực.

Scanning these texts, we obtained T1={"Vietnam", "Italy",

"1998", "60", "40"} and T2 ={"1998", "Việt Nam", "Italy",

"60", "40"} We measure the similarity of cognates between

A and B by using the algorithm presented in figure 2

Ifsim cognates (A, B) greater than a threshold then the pair

(A, B) is a candidate Thesim cognates (A, B) is calculated as

in formula (1)

sim cognates (A, B) = Number of tokens in T1 count (1)

In addition to cognates, we observe that text length and

number of paragraphs also provide evidences for measuring

content similarity between two texts Parallel texts usually have

similar text lengths and numbers of paragraphs Therefore,

given a pair of texts we design three features as follows:

• The first feature estimates the cognate-based similarity It

is computed by the formula (1)

• The second feature estimates the similarity of text lengths

A method to filter out the wrong pairs is to compare

the lengths of the two texts by characters We will set a

2 http://vi.wikipedia.org/wiki/Danh_sách_quốc_gia

reasonable threshold of the rate between the two texts so that it will keep potential candidates

• The third feature estimates the rate of paragraphs of the two texts In our opinion, two parallel texts often have similar numbers of paragraphs in the texts, so a feature for representing this criterion is necessary

C Structure Analysis Module

Beside finding candidate pairs based on the content of the texts, the similarity of structure of the HTML pages also provide useful information for determining whether a pair of web pages is mutual translation or not This method uses the hypothesis that parallel web pages are presented in similar structures Note that this approach does not require linguistical knowledge For presenting structural features we follow the approach presented in [10] The structural analysis module is implements in the two steps: At the first step, both documents in the candidate pair are analyzed through

a markup analyzer that acts as a transducer, producing

a linear sequence containing three kinds of token [10]:

[START: element l abel], [END:element l abel], [Chunk:length].

At the second step, we will align the linearized sequences using a dynamic programming algorithm

After alignment, we compute four scalar values that char-acterize the quality of the alignment [10]:

• dp The difference percentage, indicating non-shared

ma-terial

• n The number of aligned non-markup text chunks of

unequal length

• r The correlation of lengths of the aligned non-markup

chunks

• p The significance level of the correlation r.

In addition, we observe that a page on a bilingual news site, which is the translation of the original page will be created

in the short period of time after the original was published Therefore, using this feature we can eliminate many pairs which are not parallel For example, in the bilingual news sites such as BBC, VOA, the Vietnamese pages are published on the same day or one day later than the corresponding English pages [14] To extract this information, we conducted analysis

of the HTML tags and then group this feature (publication

date) into the structural feature set Note that this information

is extracted from the different HTML formats (e.g META tag

in the BBC site:<META name = "OriginalPublicationDate"

content = "2009/11/22 12:29:48"/ >, SPAN tag in the

Viet-namPlus site:<SPAN id = "ctl00_mContent_lbDate" class =

"timeDate">10/04/2009</SPAN>, ect).

D Classification modeling

As a result from content-based and structure analysis of each pair of web pages we obtain the features which are divided into the two categories: content features and structural features Content features include sim cognates (A, B), text length and

number of paragraphs Structure features include dp, n, r, p

and publication date It is now easy to formulate the task as

classification problem Each candidate pair of web pages is

Trang 4

represented by a vector of these features We will label them

by 1 or 0 if each pair is parallel or not respectively By this

way we will obtain the training data In our system, we use

a support vector machine algorithm to train a classification

system For a new pair of web pages, we first extract features

to have its representation as a vector This vector goes through

the classification system and get the result as 1 or 0 It means

that we will have the answer about whether this pair is parallel

or not

III EXPERIMENT

A Data preparation

We have explored several news sites of bilingual

English-Vietnamese on the World Wide Web There are a few sites

with high translation quality In this system, we experiment

with 94,323 pages are downloaded from the three web sites:

37,665 pages from BBC3; 12,553 pages from VietnamPlus4

and 44,105 pages from VOA News5

Firstly, we perform a host crawling on the specified domains

And then, the HTML pages are downloaded All data collected

is analyzed by the CB modules to filter candidate pairs We

have used some thresholds as follows

sim cognates (A, B) > 0.5,

publication date ≤1.

As the result we have excluded over 90% of the pairs which

are not considered as candidates Consequently we receive a

number of 1,170 pairs which are considered as candidates for

determining whether each pair of them is parallel or not Next,

all data obtained from the content filter module go into the

structure module to extract the designed features We then

labeled 0 or 1 for each pair of the candidates A pair will be

labeled by 1 if it is parallel, in contrast it will be labeled by

0 There are 433 pairs labeled 1 and 737 pairs labeled 0 from

these 1,170 pairs of candidates After that, we construct this

data with format: <label> <index1>: <value1> <index2>:

<value2> which is suitable for using the LIBSVM tool6

B Experimental results

We conduct a 5-folds cross-validation experiment, each fold

had 234 test items and 936 training items For investigating

the effectiveness of different kinds of features, we here design

three feature sets including: the feature set containing only

content-based (CB) features; the feature set containing only

structure-based (SB) features; and the feature set which include

these both kinds of features (i.e CB and SB features)

We also use the three well-known measures for evaluation,

as follows

Precision = Total no of pairs labeled 1 in data output (2) No of pairs labeled 1 are true

Recall = Total no of pairs labeled 1 in data test No of pairs labeled 1 are true (3)

3 http://news.bbc.co.uk

4 http://en.vietnamplus.vn, http://www.vietnamplus.vn

5 http://www.voanews.com

6 http://www.csie.ntu.edu.tw/ cjlin/libsvm/

F-Score = 2*Precision*Recall (Precision+Recall) (4)

It is worth to note that for comparing our approach and previous approaches in using content-based features we also conduct an experiment like in [15] This study measure the similarity of content based on aligning word translation of the two texts Here, we use a bilingual English-Vietnamese dictionary to compute a content-based similarity score For each pair of two texts (or web pages), the similarity score is defined as follows

sim(A, B) = Number of translation token pairs Number of tokens in text A (5)

With this experiment we obtained the result as shown in Table I

TABLE I

E VALUATING C ONTENT -B ASED M ATCHING (U SING THE BILINGUAL DICTIONARY TO MATCH PAIRS OF WORD - WORD ).

Precision Recall F-Score Fold 1 0.688 0.484 0.568 Fold 2 0.647 0.478 0.550 Fold 3 0.643 0.548 0.591 Fold 4 0.601 0.569 0.584 Fold 5 0.682 0.528 0.595 Average 0.652 0.521 0.578

Table II shows our experimental result by using the content-based features, Table III shows our result by using structure-based features, and Table IV contains the result obtained by combining these both kinds of features

TABLE II

E VALUATING C ONTENT -B ASED M ATCHING (U SING EXTRACTED FEATURES FROM C ONTENT - BASED F ILTERING M ODULE ).

TABLE III

E VALUATING S TRUCTURE -B ASED M ATCHING

It is worth to note that in such this task (extracting parallel corpus) the precision is the most important criterion for evaluating the effectiveness of the system According to results shown in the above tables we can see that: the precision

of using CB feature set (85.5%) is much higher than using

Trang 5

C OMBINING S TRUCTURAL AND C ONTENT -B ASED M ATCHING

Precision Recall F-Score Fold 1 0.873 0.817 0.844

Fold 2 0.862 0.842 0.852

Fold 3 0.869 0.879 0.874

Fold 4 0.904 0.817 0.858

Fold 5 0.904 0.733 0.810

Average 0.882 0.817 0.848

SB feature set (44.4%) And our approach of extracting CB

features is also much better the approach in [15] which obtain

only 65.2% of precision The combination of both feature

kinds gives the best result, with the precision 88.2%

These results have shown that the CB features as we

proposed is so effective Note that it also suggests that if we

are not sure about the structural corresponding between the

two web pages, we can use only content-based features

IV CONCLUSION This paper presents our work on extracting a parallel corpus

from the Web for the language pair of English and Vietnamese

We have proposed a new approach for measuring the similarity

of content of the two pages which does not require a deep

linguistical analysis We have utilized both structural features

and content-based features under a framework of machine

learning The obtained results have shown that content-based

features as proposed is the major information for determining

a pair of web pages is parallel or not In addition, our approach

can be applied for other pairs of languages because that

the features used in the proposed model is independent of

language In the future we will extend our work on extracting

smaller parallel components such as paragraphs, sentences or

phrases This work will also be interesting in the case the

quality of translation between bilingual web pages is not good

Acknowledgment

This work is supported by NAFOSTED (Vietnam’s National

Foundation for Science and Technology Development)

REFERENCES [1] Akira Kumano and Hideki Hirakawa 1994 Building an MT dictionary

from parallel texts based on linguisitic and statistical information In Proc.

15th COLING, pages 76-81.

[2] Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer,

R., Roosin, P (1990) "A statistical approach to machine translation".

Computational Linguistics, 16(2), 79-85.

[3] Chen J., Nie J.Y 2000 Automatic construction of parallel

English-Chinese corpus for cross-language information retrieval," Proc ANLP,

pp 21-28, Seattle.

[4] Chen, J., Chau, R and Yeh, C.-H (2004) Discovering Parallel Text from

The World Wide Web In Proc Australasian Workshop on Data Mining

and Web Intelligence (DMWI2004)

[5] Davis, M., Dunning, T (1995) "A TREC evaluation of query translation

methods for multi-lingual text retrieval" Fourth Text Retrieval Conference

(TREC- 4) NIST.

[6] Martin Volk, Spela Vintar, Paul Buitelaar "Ontologies in Cross-Language

Information Retrieval," Wissensmanagement 2003: 43-50.

[7] Melamed, I D (1998) "Word-to-word models of translation

equiva-lence" IRCS technical report 98-08, University of Pennsylvania.

[8] Michel Simard, George F Foster, Pierre Isabelle "Using Cognates to Align Sentences in Bilingual Corpora".

[9] Oard, D W (1997) "Cross-language text retrieval research in the USA Third DELOS Workshop" European Research Consortium for Informat-ics and MathematInformat-ics.

[10] P Resnik and N A Smith 2003 The Web as a Parallel Corpus Computational Linguistics, 2003, 29(3):349-380.

[11] Resnik, Philip 1998 Parallel strands: A preliminary investigation into mining the Web for bilingual text In Proceedings of the Third Conference

of the Association for Machine Translation in the Americas (AMTA-98) Langhorne, PA, October 28-31.

[12] Resnik, Philip 1999 Mining the Web for bilingual text In Proceedings

of the 37th Annual Meeting of the ACL, pages 527-534, College Park,

MD, June.

[13] Takehito Utsuro, Hiroshi Ikeda Masaya Yamane, Yuji Matsumoto, and Makoto Nagao 1994 Bilingual text matching using bilingual dictionary and statistics In Proc 15th COLING, pages 1076-1082.

[14] Van B Dang, Ho Bao-Quoc 2007 Automatic Construction of English-Vietnamese Parallel Corpus through Web Mining Proceedings of 5th IEEE International Conference on Computer Science - Research, Innova-tion and Vision of the Future (RIVF’2007), Hanoi, Vietnam.

[15] Xiaoyi Ma, Liberman Mark 1999 BITS: A method for bilingual text search over the Web Machine Translation Summit VII, September, 1999.

Định dạng
Số trang	5
Dung lượng	203,95 KB