Extracting Parallel Texts from the WebLe Quang Hung Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com Le Anh Cuong University of Engineering and Tech
Trang 1Extracting Parallel Texts from the Web
Le Quang Hung
Faculty of Information Technology Quynhon University, Vietnam Email: hungqnu@gmail.com
Le Anh Cuong
University of Engineering and Technology Vietnam National University, Hanoi Email: cuongla@vnu.edu.vn
Abstract— Parallel corpus is the valuable resource for some
important applications of natural language processing such as
statistical machine translation, dictionary construction,
cross-language information retrieval The Web is a huge resource
of knowledge, which partly contains bilingual information in
various kinds of web pages It currently attracts many studies
on building parallel corpora based on the Internet resource.
However, obtaining a parallel corpus with high accuracy is
still a challenge This paper focuses on extracting parallel texts
from bilingual web-sites of the English and Vietnamese language
pair We first propose a new way of designing content-based
features, and then combining them with structural features under
a framework of machine learning In the experiment we obtain
88.2% of precision for the extracted parallel texts.
I INTRODUCTION Parallel corpus has been used in many research areas of
natural language processing For example, parallel texts are
used for connection between vocabularies in cross language
information retrieval [5], [6], [9] Moreover, extracting
seman-tically equivalent components of the parallel texts as words,
phrases, sentences are useful for bilingual dictionary
construc-tion [1] and statistical machine translaconstruc-tion [2], [7] However,
the available parallel corpora are not only in relatively small
size, but also unbalanced [15] even in the major languages
Along with the development the Internet, World Wide Web is
really a huge database containing multi-language documents
thus it is useful for bilingual texts processing
Up to now, some systems have been built for mining
parallel corpus These studies can be divided into three main
kinds including content-based (CB), structure-based (SB) and
combination of the both methods For CB approach, [3], [13],
[14] uses a bilingual dictionary to match pairs of word-word in
two languages Meanwhile, the SB approach [11], [12] relies
on analysis HTML structure of page Other studies such as [4],
[10] have combined the two methods to improve performance
of their systems
Parallel web pages in a site in general speaking have
comparable structures and contents Therefore, a big number
of these studies focuses on finding characteristics of HTML
structures such as URL links, filename, HTML tags PTMiner
such as size, date, ect Note that this criterion does not appear in most of the bilingual English-Vietnamese web sites STRAND [10] has a similar approach to PTMiner except that
it handles the case where URL-matching requires multiple substitutions This system also proposes a new method that combines content-based and structure matching by using a cross-language similarity score as an additional parameter of the structure-based method
In our knowledge, there is rarely studies on this field related
to Vietnamese [14] built an English-Vietnamese parallel cor-pus based on content-based matching Firstly, candidate web page pairs are found by using the features of sentence length and date Then, they measure similarity of content using a bilingual English-Vietnamese dictionary and making decision that whether two papers are parallel based on some thresholds
of this measure Note that this system only searches for parallel pages that are good translations of each other and they are required being written in the same style Moreover using word-word translation will cause much ambiguity Therefore this approach is difficult to extend when the data increases as well
as when applying for bilingual web sites with various styles
In this paper, we aim to automatically extracting English-Vietnamese parallel texts from bilingual web-sites of news
As encouraging by [10] we formulate this problem as classification problem to utilize as much as possible the knowledge from structural information and the similarity of content It is worth to emphasize that different from previous studies [3], [10] we use cognate information replace of word-word translation From our observation, when translating a text from one language to the another, some special parts will be kept or changed in a little These parts are usually abbreviation, proper noun, and number In addition, we also use other content-based features such as the length of tokens, the length of paragraphs, which also do not require any linguistical analysis It is worth to note that by this approach we do not need any dictionary thus we think it can be apply for other language pairs Our experiment is conducted on the web sites containing English-Vietnamese documents, including BBC (http://www.bbc.co.uk),
2010 Second International Conference on Knowledge and Systems Engineering
Trang 2experiment in which we will implement different feature sets.
Finally, conclusion is derived in section IV
II THEPROPOSEDMODEL
In this paper, we follow the approach which combines
content-based features and structure-based features of the
HTML pages to extract parallel texts from the Web by using
machine learning [10] The machine learning algorithm used
here is Support Vector Machine (SVM) Figure 1 illustrates
the general architecture of our proposed model As shown in
the model it includes the following tasks:
• Firstly, we use a crawler on the specified domains to
extract bilingual English-Vietnamese pages which are
called raw data
• Secondly, from the raw data, we will create candidates of
parallel web pages by using some threshold of extracted
features (content-based features and the feature about
date)
• Thirdly, we manually label these candidates and then we
have a training data It means that we will obtain some
pairs of parallel web pages which are assigned with label
1, and some other pairs of parallel web pages which are
assigned with labeled 0 (the detail of this task is presented
in the experiment section)
• Fourthly, we will extract structural features and
content-based features so that each web page pair can be
repre-sented as a vector of these features This representation
is required to fit a classification model
• Finally, we use a SVM tool to train a classification system
on this training data It means that if we have a pair of
English-Vietnamese web pages for test, then the obtained
classification will decide whether it is parallel or not
A Host crawling
Bilingual English-Vietnamese web pages are collected by crawling the Web using a Web spider as in [4] A Web spider is
a software tool that traverses a site to gather web pages by fol-lowing the hyperlinks appeared in the web pages To describe this process, our system uses the Teleport-Pro to retrieve web pages from remote web-sites Teleport-Pro is a tool designed to download the documents on the Web via HTTP and FTP proto-cols and store the extracted data in disk [15] Note that we se-lect the URLs on the specified hosts from the three news sites: BBC, VietnamPlus and VOA News For example, the URL on the BBC site for English is: "http://news.bbc.co.uk/english/" and "http://www.bbc.co.uk/vietnamese/" for Vietnamese We then use Teleport-Pro to download the HTML pages for obtaining the candidate web pages
B Content-based Filtering Module
As common understanding, using content-based features we want to determine whether two pages are mutual translation However, as [15] pointed out, not all translators create trans-lated pages that look like the original page Moreover, SB matching is applicable only in corpora that include markup, and there are certainly multilingual collections on the Web and elsewhere that contain parallel text without structural tags [10] Many studies have used this approach to build a parallel corpus from the Web such as [4], [14] They use a bilingual dictionary
to measure the similarity of the contents of two texts However, this method can cause much ambiguity because a word has many its translations For English-Vietnamese, one word in English can correspond to multiple words in Vietnamese In this paper, we propose a different approach which provides a cheap and reasonably resource This proposal is based on an observation that a document usually contain some cognates and if two documents are mutual translations then the cognates are usually kept the same in both of them1 Note that [8] also use cognates but for sentence alignment From our observation,
we divide a token which is considered as a cognate into the three kinds as follows:
1) The abbreviations (e.g "EU", "WTO") 2) The proper nouns in English (e.g "Vietnam", "Paris") 3) The numbers (e.g "10", "2010")
Now, we can design a feature for measuring content sim-ilarity based on cognates This feature is computed by the rate between the number of corresponding cognates between the two texts and the number of tokens in one text (e.g for English text)
Given a pair of texts (A, B) where A stands for English and
B stands for Vietnamese Then, we respectively obtained the token set of cognates T1 and T2 from A and B For a robust matching between cognates we make some modifications of the original token:
• A number which is written as sequence of letters in the English alphabet is converted into a real number
Accord-1 Cognates in linguistics are words that have a common etymological origin (http://en.wikipedia.org/wiki/Cognate)
Trang 3ing to our observations, units of the numbers in English
are often retained when translated into Vietnamese So,
we do not consider in case the units are different (e.g
inch vs cm, pound vs kg, USD vs VND, etc)
• We use a list which contains the corresponding names
between English and Vietnamese They include names of
countries, continents, date, ect However, the names of
countries in English can be translated into Vietnamese
in different ways Therefore, we only consider these
names in English, which Vietnamese names
correspond-ing have been published on Wikipedia site2 For example,
"Mexico" in English vs "Mêhicô" or "Mễ Tây Cơ" in
Vietnamese are the same
The following is an example of two corresponding texts of
English and Vietnamese:
• Vietnam and Italy through three cooperation programmes
beginning in 1998 have so far signed more than 60
projects on joint scientific research Of the figure, 40
projects have been carried out and brought good results.
• Từ 1998, đến nay, Việt Nam và Italy đã kí kết hơn 60 dự
án hợp tác nghiên cứu chung, có khoảng 40 dự án được
triển khai thực hiện và đạt được kết quả tích cực.
Scanning these texts, we obtained T1={"Vietnam", "Italy",
"1998", "60", "40"} and T2 ={"1998", "Việt Nam", "Italy",
"60", "40"} We measure the similarity of cognates between
A and B by using the algorithm presented in figure 2
Ifsim cognates (A, B) greater than a threshold then the pair
(A, B) is a candidate Thesim cognates (A, B) is calculated as
in formula (1)
sim cognates (A, B) = Number of tokens in T1 count (1)
In addition to cognates, we observe that text length and
number of paragraphs also provide evidences for measuring
content similarity between two texts Parallel texts usually have
similar text lengths and numbers of paragraphs Therefore,
given a pair of texts we design three features as follows:
• The first feature estimates the cognate-based similarity It
is computed by the formula (1)
• The second feature estimates the similarity of text lengths
A method to filter out the wrong pairs is to compare
the lengths of the two texts by characters We will set a
2 http://vi.wikipedia.org/wiki/Danh_sách_quốc_gia
reasonable threshold of the rate between the two texts so that it will keep potential candidates
• The third feature estimates the rate of paragraphs of the two texts In our opinion, two parallel texts often have similar numbers of paragraphs in the texts, so a feature for representing this criterion is necessary
C Structure Analysis Module
Beside finding candidate pairs based on the content of the texts, the similarity of structure of the HTML pages also provide useful information for determining whether a pair of web pages is mutual translation or not This method uses the hypothesis that parallel web pages are presented in similar structures Note that this approach does not require linguistical knowledge For presenting structural features we follow the approach presented in [10] The structural analysis module is implements in the two steps: At the first step, both documents in the candidate pair are analyzed through
a markup analyzer that acts as a transducer, producing
a linear sequence containing three kinds of token [10]:
[START: element l abel], [END:element l abel], [Chunk:length].
At the second step, we will align the linearized sequences using a dynamic programming algorithm
After alignment, we compute four scalar values that char-acterize the quality of the alignment [10]:
• dp The difference percentage, indicating non-shared
ma-terial
• n The number of aligned non-markup text chunks of
unequal length
• r The correlation of lengths of the aligned non-markup
chunks
• p The significance level of the correlation r.
In addition, we observe that a page on a bilingual news site, which is the translation of the original page will be created
in the short period of time after the original was published Therefore, using this feature we can eliminate many pairs which are not parallel For example, in the bilingual news sites such as BBC, VOA, the Vietnamese pages are published on the same day or one day later than the corresponding English pages [14] To extract this information, we conducted analysis
of the HTML tags and then group this feature (publication
date) into the structural feature set Note that this information
is extracted from the different HTML formats (e.g META tag
in the BBC site:<META name = "OriginalPublicationDate"
content = "2009/11/22 12:29:48"/ >, SPAN tag in the
Viet-namPlus site:<SPAN id = "ctl00_mContent_lbDate" class =
"timeDate">10/04/2009</SPAN>, ect).
D Classification modeling
As a result from content-based and structure analysis of each pair of web pages we obtain the features which are divided into the two categories: content features and structural features Content features include sim cognates (A, B), text length and
number of paragraphs Structure features include dp, n, r, p
and publication date It is now easy to formulate the task as
classification problem Each candidate pair of web pages is
Trang 4represented by a vector of these features We will label them
by 1 or 0 if each pair is parallel or not respectively By this
way we will obtain the training data In our system, we use
a support vector machine algorithm to train a classification
system For a new pair of web pages, we first extract features
to have its representation as a vector This vector goes through
the classification system and get the result as 1 or 0 It means
that we will have the answer about whether this pair is parallel
or not
III EXPERIMENT
A Data preparation
We have explored several news sites of bilingual
English-Vietnamese on the World Wide Web There are a few sites
with high translation quality In this system, we experiment
with 94,323 pages are downloaded from the three web sites:
37,665 pages from BBC3; 12,553 pages from VietnamPlus4
and 44,105 pages from VOA News5
Firstly, we perform a host crawling on the specified domains
And then, the HTML pages are downloaded All data collected
is analyzed by the CB modules to filter candidate pairs We
have used some thresholds as follows
sim cognates (A, B) > 0.5,
publication date ≤1.
As the result we have excluded over 90% of the pairs which
are not considered as candidates Consequently we receive a
number of 1,170 pairs which are considered as candidates for
determining whether each pair of them is parallel or not Next,
all data obtained from the content filter module go into the
structure module to extract the designed features We then
labeled 0 or 1 for each pair of the candidates A pair will be
labeled by 1 if it is parallel, in contrast it will be labeled by
0 There are 433 pairs labeled 1 and 737 pairs labeled 0 from
these 1,170 pairs of candidates After that, we construct this
data with format: <label> <index1>: <value1> <index2>:
<value2> which is suitable for using the LIBSVM tool6
B Experimental results
We conduct a 5-folds cross-validation experiment, each fold
had 234 test items and 936 training items For investigating
the effectiveness of different kinds of features, we here design
three feature sets including: the feature set containing only
content-based (CB) features; the feature set containing only
structure-based (SB) features; and the feature set which include
these both kinds of features (i.e CB and SB features)
We also use the three well-known measures for evaluation,
as follows
Precision = Total no of pairs labeled 1 in data output (2) No of pairs labeled 1 are true
Recall = Total no of pairs labeled 1 in data test No of pairs labeled 1 are true (3)
3 http://news.bbc.co.uk
4 http://en.vietnamplus.vn, http://www.vietnamplus.vn
5 http://www.voanews.com
6 http://www.csie.ntu.edu.tw/ cjlin/libsvm/
F-Score = 2*Precision*Recall (Precision+Recall) (4)
It is worth to note that for comparing our approach and previous approaches in using content-based features we also conduct an experiment like in [15] This study measure the similarity of content based on aligning word translation of the two texts Here, we use a bilingual English-Vietnamese dictionary to compute a content-based similarity score For each pair of two texts (or web pages), the similarity score is defined as follows
sim(A, B) = Number of translation token pairs Number of tokens in text A (5)
With this experiment we obtained the result as shown in Table I
TABLE I
E VALUATING C ONTENT -B ASED M ATCHING (U SING THE BILINGUAL DICTIONARY TO MATCH PAIRS OF WORD - WORD ).
Precision Recall F-Score Fold 1 0.688 0.484 0.568 Fold 2 0.647 0.478 0.550 Fold 3 0.643 0.548 0.591 Fold 4 0.601 0.569 0.584 Fold 5 0.682 0.528 0.595 Average 0.652 0.521 0.578
Table II shows our experimental result by using the content-based features, Table III shows our result by using structure-based features, and Table IV contains the result obtained by combining these both kinds of features
TABLE II
E VALUATING C ONTENT -B ASED M ATCHING (U SING EXTRACTED FEATURES FROM C ONTENT - BASED F ILTERING M ODULE ).
Precision Recall F-Score Fold 1 0.831 0.907 0.867 Fold 2 0.823 0.864 0.843 Fold 3 0.810 0.836 0.823 Fold 4 0.878 0.765 0.818 Fold 5 0.931 0.803 0.862 Average 0.855 0.835 0.843
TABLE III
E VALUATING S TRUCTURE -B ASED M ATCHING
Precision Recall F-Score Fold 1 0.409 0.620 0.493 Fold 2 0.518 0.614 0.562 Fold 3 0.397 0.614 0.482 Fold 4 0.451 0.763 0.567 Fold 5 0.444 0.654 0.529 Average 0.444 0.653 0.529
It is worth to note that in such this task (extracting parallel corpus) the precision is the most important criterion for evaluating the effectiveness of the system According to results shown in the above tables we can see that: the precision
of using CB feature set (85.5%) is much higher than using
Trang 5C OMBINING S TRUCTURAL AND C ONTENT -B ASED M ATCHING
Precision Recall F-Score Fold 1 0.873 0.817 0.844
Fold 2 0.862 0.842 0.852
Fold 3 0.869 0.879 0.874
Fold 4 0.904 0.817 0.858
Fold 5 0.904 0.733 0.810
Average 0.882 0.817 0.848
SB feature set (44.4%) And our approach of extracting CB
features is also much better the approach in [15] which obtain
only 65.2% of precision The combination of both feature
kinds gives the best result, with the precision 88.2%
These results have shown that the CB features as we
proposed is so effective Note that it also suggests that if we
are not sure about the structural corresponding between the
two web pages, we can use only content-based features
IV CONCLUSION This paper presents our work on extracting a parallel corpus
from the Web for the language pair of English and Vietnamese
We have proposed a new approach for measuring the similarity
of content of the two pages which does not require a deep
linguistical analysis We have utilized both structural features
and content-based features under a framework of machine
learning The obtained results have shown that content-based
features as proposed is the major information for determining
a pair of web pages is parallel or not In addition, our approach
can be applied for other pairs of languages because that
the features used in the proposed model is independent of
language In the future we will extend our work on extracting
smaller parallel components such as paragraphs, sentences or
phrases This work will also be interesting in the case the
quality of translation between bilingual web pages is not good
Acknowledgment
This work is supported by NAFOSTED (Vietnam’s National
Foundation for Science and Technology Development)
REFERENCES [1] Akira Kumano and Hideki Hirakawa 1994 Building an MT dictionary
from parallel texts based on linguisitic and statistical information In Proc.
15th COLING, pages 76-81.
[2] Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer,
R., Roosin, P (1990) "A statistical approach to machine translation".
Computational Linguistics, 16(2), 79-85.
[3] Chen J., Nie J.Y 2000 Automatic construction of parallel
English-Chinese corpus for cross-language information retrieval," Proc ANLP,
pp 21-28, Seattle.
[4] Chen, J., Chau, R and Yeh, C.-H (2004) Discovering Parallel Text from
The World Wide Web In Proc Australasian Workshop on Data Mining
and Web Intelligence (DMWI2004)
[5] Davis, M., Dunning, T (1995) "A TREC evaluation of query translation
methods for multi-lingual text retrieval" Fourth Text Retrieval Conference
(TREC- 4) NIST.
[6] Martin Volk, Spela Vintar, Paul Buitelaar "Ontologies in Cross-Language
Information Retrieval," Wissensmanagement 2003: 43-50.
[7] Melamed, I D (1998) "Word-to-word models of translation
equiva-lence" IRCS technical report 98-08, University of Pennsylvania.
[8] Michel Simard, George F Foster, Pierre Isabelle "Using Cognates to Align Sentences in Bilingual Corpora".
[9] Oard, D W (1997) "Cross-language text retrieval research in the USA Third DELOS Workshop" European Research Consortium for Informat-ics and MathematInformat-ics.
[10] P Resnik and N A Smith 2003 The Web as a Parallel Corpus Computational Linguistics, 2003, 29(3):349-380.
[11] Resnik, Philip 1998 Parallel strands: A preliminary investigation into mining the Web for bilingual text In Proceedings of the Third Conference
of the Association for Machine Translation in the Americas (AMTA-98) Langhorne, PA, October 28-31.
[12] Resnik, Philip 1999 Mining the Web for bilingual text In Proceedings
of the 37th Annual Meeting of the ACL, pages 527-534, College Park,
MD, June.
[13] Takehito Utsuro, Hiroshi Ikeda Masaya Yamane, Yuji Matsumoto, and Makoto Nagao 1994 Bilingual text matching using bilingual dictionary and statistics In Proc 15th COLING, pages 1076-1082.
[14] Van B Dang, Ho Bao-Quoc 2007 Automatic Construction of English-Vietnamese Parallel Corpus through Web Mining Proceedings of 5th IEEE International Conference on Computer Science - Research, Innova-tion and Vision of the Future (RIVF’2007), Hanoi, Vietnam.
[15] Xiaoyi Ma, Liberman Mark 1999 BITS: A method for bilingual text search over the Web Machine Translation Summit VII, September, 1999.