Parallel texts extraction from the web

Table 1.1: Europarl parallel corpus: 10 aligned language pairs all of whichin mining parallel corpora from the Web.. The most important contribution of our work is that weproposed two ne

Trang 1

from the Web

Trang 2

ORIGINALITY STATEMENT i

1.1 Parallel corpus and its role 1

1.2 Current studies on automatically extracting parallel corpus 3

1.3 Objectives of the thesis 4

1.4 Contributions 5

1.5 Thesis’ structure 5

2 Related works 7 2.1 The general framework 7

2.2 Structure-based methods 8

2.3 Content-based methods 12

2.4 Hybrid methods 14

2.5 Summary 15

3 The proposed approach 16 3.1 The proposed model 16

3.1.1 Host crawling 17

3.1.2 Content-based filtering module 18

3.1.2.1 The method based on cognation 20

3.1.2.2 The method based on identifying translation seg-ments 23

3.1.3 Structure analysis module 28

iv

Trang 3

3.1.4 Classification modeling 30

3.2 Summary 31

4 Experiment 32 4.1 Evaluation measures 32

4.2 Experimental setup 33

4.3 Experimental results 36

4.4 Discussion 40

5 Conclusion and Future Works 41 5.1 Conclusion 41

5.2 Future works 42

Trang 4

1.1 An example of English-Vietnamese parallel texts 2

2.1 General architecture in building parallel corpus 7

2.2 The STRAND architecture [1] 9

2.3 An example of aligning two documents 10

2.4 The workflow of the PTMiner system [2] 11

2.5 The algorithm of translation pairs finder [3] 13

2.6 Architecture of the PTI system [4] 13

2.7 An example of the two links in the text 15

3.1 Architecture of the Parallel Text Mining system 17

3.2 Architecture of a standard Web crawler 18

3.3 An example of a candidate pair 19

3.4 Description of the process content-based filtering module 20

3.5 An example of two corresponding texts of English and Vietnamese 22 3.6 The algorithm measures similarity of cognates between a texts pair (Etext, V text) 22

3.7 Relationships between bilingual web pages 24

3.8 The paragraphs can be denoted from HTML pages based on the tag < p > 25

3.9 Identifying translation paragraphs 27

3.10 A sample code written in Java to perform translation from English into Vietnamese via Google AJAX API 27

3.11 Web documents and the source HTML code for two parallel trans-lated texts 29

3.12 An example of the publication date feature is extracted from a HTML page 30

3.13 Classification model 31

4.1 Figure for precision and recall measures 32

4.2 The format of training and testing data 34

4.3 Performance of identifying translation segments method 38

4.4 Comparison of the methods 39

vi

Trang 5

1.1 Europarl parallel corpus: 10 aligned language pairs all of which

include English 3

3.1 Symbols and descriptions 30

4.1 URLs from three sites: BBC, VOA News and VietnamPlus 33

4.2 No pages downloaded and No candidate pairs 34

4.3 Structure-based method 36

4.4 Content-based method 36

4.5 Method based on cognation 37

4.6 Combining structural features and cognate information 37

4.7 Identifying translation at document level 37

4.8 Identifying translation at paragraph level 38

4.9 Identifying translation at sentence level 38

4.10 Overall results of each method (P-Precision, R-Recall, F-FScore) 39

vii

Trang 6

In this chapter, we first introduce about parallel corpus and its role in NLP plications Current studies, objectives of the thesis and contributions are thenpresented Finally, the thesis’ structure is shortly described

Parallel text

Different definitions of the term “parallel text” (also known as bitext ) can befound in the literature As common understanding, a parallel text is a text inone language together with its translation in another language Dan Tufis [5]gives a definition: “parallel text is an association between two texts in differentlanguages that represent translations of each other” Figure1.1 shows an example

of English-Vietnamese parallel texts

Parallel corpus

A parallel corpus is a collection of parallel texts According to [6], the simplestcase is where two languages only are involved, one of the corpora is an exacttranslation of the other (e.g., COMPARA corpus [7]) However, some parallelcorpora exist in several languages For instance, Europarl parallel corpus [8] whichincludes versions in 11 European languages as report in Table1.1 In addition, thedirection of the translation need not be constant, so that some texts in a parallel

1

Trang 7

Figure 1.1: An example of English-Vietnamese parallel texts.

corpus may have been translated from language L1 to language L2 and others theother way around The direction of the translation may not even be known

The parallel corpora exist in several formats They can be raw parallel texts orthey can be aligned texts The texts can be aligned in paragraph level, sentencelevel or even in phrase level and word level The alignment of the texts is useful fordifferent NLP tasks Statistical machine translation [9, 10] uses parallel sentences

as the input for the alignment module which produces word translation ities Cross language information retrieval [11–13] uses parallel texts for deter-mining corresponding information in both questioning and answering Extractingsemantically equivalent components of the parallel texts as words, phrases, sen-tences are useful for bilingual dictionary construction [14, 15] The parallel textsare also used for acquisition of lexical translation [16] or word sense disambiguation[17] For most of the mentioned tasks, the parallel corpora are currently playing

probabil-a cruciprobabil-al role in NLP probabil-applicprobabil-ations

Trang 8

Table 1.1: Europarl parallel corpus: 10 aligned language pairs all of which

in mining parallel corpora from the Web Basically, we can classify these studiesinto three groups: content-based (CB) [3, 4, 22], structure-based (SB) [1, 2, 18],and hybrid (combination of the both methods) [19–21]

The CB approach uses the textual content of the parallel document pairs beingevaluated This approach usually uses lexicon translations getting from a bilingualdictionary to measure the similarity of content of the two texts When the bilin-gual dictionary is available, documents are translated word by word to the targetlanguage The translated documents then are used to find the best matching par-allel documents by applying similarity scores functions such as cosine, Jaccard,Dice, etc However, using bilingual dictionary may face difficulty because a wordusually has many its translations

Meanwhile, the SB approach relies on analysis HTML structure of pages Thisapproach uses the hypothesis that parallel web pages are presented in similarstructures The similarity of the web pages are estimated based on the structuralHTML of them Note that this approach does not require linguistical knowledge

Trang 9

In addition, this approach is very effective in filtering a big number of unmatcheddocuments, as it is quite fast but accuracy Nevertheless, it has drawbacks thatrequires the presentation of two sites with similar content must be presented inthe same From our observation, many sites use the same template to design theWeb, the structure of pages is similar but the content of them is different Forthat reason, HTML structure-based approach is not applicable in some cases.

As we have introduced, the parallel corpus is the valuable resource for differentNLP tasks Unfortunately, the available parallel corpora are not only in relativelysmall size, but also unbalanced even in the major languages [3] Some resourcesare available, such as for English-French, the data are usually restricted to gov-ernment documents (e.g., the Hansard corpus) or newswire texts The others arelimited availability due to licensing restrictions as [23] According to [24], there arenow some reliable parallel corpora: Hansard Corpus1, JRC-Acquis Parallel Cor-pus2, Europarl3, and COMPARA4 However, these resources only exist for somelanguage pairs

In Vietnam, the NLP is in early stage The lack of parallel corpora is moresevere The lack of such kind of resource has been an obstacle in the development

of the data-driven NLP technologies There are a few studies of mining parallelcorpora from the Web, one of them is presented in [22] (for English-Vietnameselanguage pair) On the other hand, the current studies [1 4,18–21] while extremelyuseful, they have a few drawbacks as mentioned in Section 1.2 So, obtaining aparallel corpus with high quality is still a challenge That is why it still remains abig motivation for many studies on this work

The objective of this research is extracting parallel texts from bilingual web sites

of the English and Vietnamese language pair We first propose two new methods ofdesigning content-based features: (1) based on cognation, (2) based on identifyingtranslation segments Then, we combine content-based features with structuralfeatures under a framework of machine learning

1 http://www.isi.edu/natural-language/download/hansard/

2 http://langtech.jrc.it/JRC-Acquis.html

3 http://www.statmt.org/europarl/

4 http://www.linguateca.pt/COMPARA/

Trang 10

1.4 Contributions

In our work, we aim to automatically extracting English-Vietnamese parallel texts

As encouraging by [20] we formulate this problem as classification problem toutilize as much as possible the knowledge from structural information and thesimilarity of content The most important contribution of our work is that weproposed two new methods of designing content-based features and combined withstructural-based features to extract parallel texts from bilingual web sites

• The first method based on cognation It is worth to emphasize that differentfrom previous studies [2, 20], we use cognate information replace of word byword translation From our observation, when translating a text from onelanguage to another, some special parts will be kept or changed in a little.These parts are usually abbreviation, proper noun, and number We alsouse other content-based features such as the length of tokens, the length ofparagraphs, which also do not require any linguistically analysis It is worth

to note that by this approach we do not need any dictionary thus we think

it can be apply for other language pairs

• The second method based on identifying translation segments use to matchtranslation paragraphs That will help us to extract proper translation units

in bilingual web pages Previous studies usually use lexicon translationsgetting from a bilingual dictionary to measure the similarity of content ofthe two texts, such as in [4, 20] This approach may face difficulty because

a word usually has many its translations Differently, we use the Googletranslator because by using it we can utilize the advantages of a statisticalmachine translation It helps to disambiguating lexical ambiguity, translat-ing phrases, and reordering

Given below is a brief outline of the topics discussed in next sections of this thesis:

Chapter 2 - Related works

The studies that have close relations with our work are introduced in this chapter

Trang 11

Chapter 3 - The proposed approach

We show our proposed model, including the general architecture of the model, howstructural features and content-based features are designed and estimated

Chapter 4 - Experiment

This chapter evaluates the goodness and effectiveness of our proposed method forextracting parallel texts from the Web The performance of our proposed andbaseline are presented in here

Chapter 5 - Conclusion and Future works

Final conclusions about our work as a whole and the evaluation of the results inparticular are presented, followed by suggestions of possible future work that could

be done

Finally, references introduce researches that are closely related to our work

Trang 12

Related works

In this chapter, we outline the general framework in building parallel corpus Then,

we review the studies that have close relations with our work

Figure 2.1: General architecture in building parallel corpus

7

Trang 13

In general, there are two approaches in building the parallel corpus (illustrated inFigure 2.1) The first one is automatically collect bilingual documents from theWeb The process of identifying parallel texts is a simple step-by-step procedure:(1) locate bilingual web sites, (2) crawl for URLs of possible parallel web pages,and (3) match parallel pages The content features and structural features used

to extract parallel texts (the detail of this task is presented in the next sections)

The other one based on the monolingual corpora [25] As seen from the diagram,starting with two large monolingual corpora (a non-parallel corpus) divided intodocuments, this approach is composed of three steps: (1) selecting pairs of sim-ilar documents, (2) from each such pair, generate all possible sentence pairs andpass them through a simple word-overlap-based filter, thus obtaining candidatesentence pairs, and (3) the candidates are presented to a maximum entropy (ME)classifier that decides whether the sentences in each pair are mutual translations

The Original STRAND is an architecture for structural translation nition, acquiring natural data Its goal is to identify pairs of web pages that aremutual translations In order to do this, it exploits an observation about the waythat web page authors disseminate information in multiple languages: When pre-senting the same content in two different languages, authors exhibit a very strongtendency to use the same document structure The STRAND therefore locatespages that might be translations of each other, via a number of different strate-gies, and filters out page pairs whose page structures diverge by too much TheSTRAND architecture has three basic steps (illustrated in Figure 2.2):

Trang 14

recog-Figure 2.2: The STRAND architecture [1].

• Location of pages that might have parallel translations,

• Generation of candidate pairs that might be translations, and

• Structural filtering out of nontranslation candidate pairs

The heart of STRAND is a structural filtering process that relies on analysis ofthe pages’ underlying HTML to determine a set of pair-specific structural values,and then uses those values to decide whether the pages are translations of oneanother The first step in this process is to linearize the HTML structure andignore the actual linguistic content of the documents

Both documents in the candidate pair are run through a markup analyzer thatacts as a transducer, producing a linear sequence containing three kinds of token:

[START:element label] e.g., [START:H3]

[END:element label] e.g., [END:H3]

[Chunk:length] e.g., [Chunk:250]

The chunk length is measured in nonwhitespace bytes, and the HTML tags are malized for case Attribute-value pairs within the tags are treated as non-markuptext (e.g., <FONT COLOR=“BLUE”> produces [START:FONT] followed by[Chunk:12])

Trang 15

nor-The second step is to align the linearized sequences using a standard dynamicprogramming technique For example, consider two documents that begin as Fig-ure 2.3.

Figure 2.3: An example of aligning two documents

Using this alignment, the authors compute four values from the aligned tures which indicate the amount of non-shared material, the number of alignednon-markup text chunks of unequal length, the correlation of lengths of the alignednon-markup chunks, and the significance level of the correlation Machine learn-ing, namely decision trees, are then used for filtering, based on these four values

struc-PTMiner system [2] works on extracting bilingual English-Chinese ments This system uses a search engine to locate for host containing the par-allel web pages In order to generate candidate pairs, the PTMiner uses a URL-matching process (e.g., Chinese translation of a URL as ”http://www.XXXX.com/ /eng/ e.html” might be ”http://www.XXXX.com/ /chi/ c.html”) and other

Trang 16

docu-features such as size, date, etc Note that the URLs do not match in most ofthe bilingual English-Vietnamese web sites.

Figure 2.4: The workflow of the PTMiner system [2]

The PTMiner implements the following steps (illustrated in Figure 2.4):

1 Search for candidate sites - Using existing Web search engines, search for thecandidate sites that may contain parallel pages

2 Filename fetching - For each candidate site, fetch the URLs of Web pagesthat are indexed by the search engines

3 Host crawling - Starting from the URLs collected in the previous step, searchthrough each candidate site separately for more URLs

4 Pair scan - From the obtained URLs of each site, scan for possible parallelpairs

5 Download and verifying - Download the parallel pages, determine file size,language, and character set of each page, and filter out non-parallel pairs

In experiment, several hundred selected pairs were evaluated manually Theirresults were quite promising, from a corpus of 250 MB of English-Chinese text,statistical evaluation showed that of the pairs identified, 90% were correct

Trang 17

2.3 Content-based methods

The approach discussed thus far relies heavily on document structure However,

as Ma and Liberman [3] point out, not all translators create translated pages thatlook like the original page Moreover, structure-based matching is applicable only

in corpora that include markup, and there are certainly multilingual collections onthe Web and elsewhere that contain parallel text without structural tags All theseconsiderations motivate an approach to matching translations that pays attention

to similarity of content, whether or not similarities of structure exist In thissection, we describe three systems: Bilingual Internet Text Search (BITS) [3],Parallel Text Identification (PTI) [4], and Dang’s system [22]

The BITS system starts with a given list of domains to search for paralleltext In this system, a translation lexicon (each entry of a translation lexicon lists

a word in language L1and its translation in language L2) is used to find translationtoken pairs For a given text A in language L1, they first tokenize A and every

B in language L2 The similarity between A and every text B in language L2 ismeasured as an algorithm in Figure2.5 Then finding the B which is most similar

to A, if the similarity between A and B is greater than a given threshold t, then Aand B are declared a translation pair The similarity between A and B is definedas

sim(A, B) = Number of translation token pairs

Number of tokens in text A (2.1)

In experiment, Ma and Liberman use an English-German bilingual lexicon of117,793 entries The authors report 99.1% precision and 97.1% recall on a hand-picked set of 600 documents (half in each language) containing 240 translationpairs (as judged by humans)

The PTI system (illustrated in Figure 2.6) crawls the Web to fetch parallelmultilingual Web documents using a Web spider To determine the parallelismbetween potential bilingual document pairs, two different modules are developed

A filename comparison module is used to check filename resemblance A contentanalysis module is used to measure the degree of semantic similarity It incor-porates a novel content-based similarity scoring method for measuring the degree

of parallelism for every potential document pair based on their semantic content

Trang 18

Figure 2.5: The algorithm of translation pairs finder [3].

using a bilingual wordlist The results showed that the PTI system achieves aprecision rate of 93% and a recall rate of 96% (180 instances is correct among atotal of 193 pairs extracted)

Figure 2.6: Architecture of the PTI system [4]

In our knowledge, there are rarely studies on this field related to Vietnamese.[22] built an English-Vietnamese parallel corpus based on content-based matching.Firstly, candidate web page pairs are found by using the features of sentence lengthand date Then, they measure similarity of content using a bilingual English-Vietnamese dictionary and making decision that whether two papers are parallelbased on some thresholds of this measure Note that this system only searches

Trang 19

for parallel pages that are good translations of each other and they are requiredbeing written in the same style Moreover, using word by word translation willcause much ambiguity Therefore, this approach is difficult to extend when thedata increases as well as when applying for bilingual web sites with various styles.

Another instance of this approach is that instead of using bilingual dictionary,

a simple word-based statistical machine translation is used to translate texts inone language to the other [26] uses this method to build an English-Chineseparallel corpus from a huge text collection of Xinhua Web bilingual news corporacollected by LDC1 By adding newly built parallel corpus to their existing corpus,they reported an increase in the translation quality of their word-based statisticalmachine translation in terms of word alignment A bootstrapping approach [27]can also be applied to incrementally increase number of both parallel sentencesand bilingual lexical vocabulary

The last version of STRAND [20] is another well-known web parallel text miningsystem Its goal is to identify pairs of web pages that are mutual translations.They used the AltaVista search engine to search for multilingual web sites andgenerated candidate pairs based on manually created substitution rules The heart

of STRAND is a structural filtering process that relies on analysis of the pages derlying HTML to determine a set of pair-specific structural values, and then usesthose values to filter the candidate pairs This system also proposes a new methodthat combines content-based and structure matching by using a cross-languagesimilarity score as an additional parameter of the structure-based method Atranslation lexicon is used to link tokens between pairs of parallel document Thelink be a pair (x, y) in which x is a word in language L1 and y is a word in L2

un-An example of two texts with links is illustrated in Figure 2.7 Using the results

of MCBM2 they defined a tsim translational similarity measure as

tsim = Number of two-word links in best matching

Number of links in best matching (2.2)

1 Linguistic Data Consortium, at http://www.ldc.upenn.edu/

2 Problem of maximum cardinality bipartite matching

Trang 20

Figure 2.7: An example of the two links in the text.

In experiment, approximately 400 pairs were evaluated by human annotators TheSTRAND produced fewer than 3500 English-Chinese pairs with a precision of 98%and a recall of 61%

In others systems, [19] proposed a method that combining length-base andcontent-based methods to do parallel text matching exploiting only title part ofweb page They achieved 100% accuracy but the recall is not high as in many cases,the title of corresponding text is not well translated In [21], they use URL-based,length-based, content-based and HTML structure features incorporated within k-nearest-neighbours classifier to do parallel text matching for English-Chinese Toidentify a bilingual web site, they using the anchor and ALT text informationwithin HTML page If some of pages have those text that match a list of pre-defined strings that indicate English and Chinese, the page will be considered as

a bilingual page [28] proposed a similar approach The author presents a tem that automatically collects bilingual texts from the Internet The criteria forparallel text detection is based on the size, HTML structures and word by wordtranslation model

In this chapter, we presented related works for mining parallel corpus from theWeb The content-based approach usually uses a bilingual dictionary to matchpairs of word-word in two languages Meanwhile, structure-based approach relies

on analysis HTML structure of pages In the real implementation, both approachesare usually employed to get good performance Generally, the structure-basedmethods are applied to quickly filter out the documents that are apparently notmatched with a given document, after that the content-based methods are applied

to find the right translational document pairs

Trang 21

The proposed approach

In this chapter, we introduce our proposed model, including the general ture of the model, how structural features and content-based features are designedand estimated We also represent the classification modeling in our system

In this work, our proposed approach whose it is combination content-based featuresand structure-based features of the HTML pages to extract parallel texts from theWeb by using machine learning [20] The machine learning algorithm used here isSupport Vector Machine (SVM) Figure3.1 illustrates the general architecture ofour proposed model As shown in the model it includes the following tasks:

• Firstly, we use a crawler on the specified domains to extract bilingual Vietnamese pages which are called raw data

English-• Secondly, from the raw data, we will create candidates of parallel web pages

by using some threshold of extracted features (content-based features andthe feature about date)

• Thirdly, we manually label these candidates and then we have a trainingdata It means that we will obtain some pairs of parallel web pages whichare assigned with label 1, and some other pairs of parallel web pages whichare assigned with label 0 (the detail of this task is presented in the experimentsection)

16

Trang 22

Figure 3.1: Architecture of the Parallel Text Mining system.

• Fourthly, we will extract structural features and content-based features sothat each web page pair can be represented as a vector of these features.This representation is required to fit a classification model

• Finally, we use a SVM tool to train a classification system on this trainingdata It means that if we have a pair of English-Vietnamese web pages fortest, then the obtained classification will decide whether it is parallel or not

3.1.1 Host crawling

According to [29], Web crawling is the process of locating, fetching, and storingthe pages on the Web The computer programs that perform this task are referred

to as Web crawlers or spiders In general terms, the working of a Web crawler is

as Figure 3.2 A typical Web crawler, starting from a set of seed pages, locatesnew pages by parsing the downloaded pages and extracting the hyper-links (inshort links) within Extracted links are stored in a FIFO fetch queue for further

Trang 23

Figure 3.2: Architecture of a standard Web crawler.

retrieval Crawling continues until the fetch queue gets empty or a satisfactorynumber of pages are downloaded

In our work, bilingual English-Vietnamese web pages are collected by crawlingthe Web using a Web spider as in [4] To execute this process, our system uses theTeleport-Pro1 to retrieve web pages from remote web sites Teleport-Pro is a tooldesigned to download the documents on the Web via HTTP and FTP protocolsand store the extracted data in disk [3] Note that, we select the URLs on thespecified hosts from the three news sites: BBC, VietnamPlus, and VOA News.For example, the URL on the BBC site for English is ”http://www.bbc.co.uk” and

”http://www.bbc.co.uk/vietnamese/” for Vietnamese Then, we use Teleport-Pro

to download the HTML pages for obtaining the candidate web pages

3.1.2 Content-based filtering module

The HTML pages are converted to plain text after they are retrieved from remoteweb sites Note that, the original web pages usually contain less user interfacecomponents such as JavaScript, Flash, etc So, we use a simple script to cleanthem and extract only text when content-based matching

1 http://www.tenmax.com/teleport/pro/home.htm

Trang 24

Figure 3.3: An example of a candidate pair.

As common understanding, using content-based features we want to determinewhether two pages are mutual translation However, as [3] pointed out, not alltranslators create translated pages that look like the original page Moreover,structure-based matching is applicable only in corpora that include markup, andthere are certainly multilingual collections on the Web and elsewhere that containparallel text without structural tags [20] Many studies have used this approach

to build a parallel corpus from the Web such as [4, 22] They use a bilingualdictionary to measure the similarity of the contents of two texts However, this

Trang 25

method can cause much ambiguity because a word usually has many its tions For English-Vietnamese, one word in English can correspond to multiplewords in Vietnamese To overcome this limitation, we propose two new methods

transla-of designing features: (1) based on cognation, (2) based on identifying translationsegments

Figure 3.4: Description of the process content-based filtering module

3.1.2.1 The method based on cognation

This method use cognate information, which provides a cheap, and reasonable source This proposal is based on an observation that a document usually containssome cognates and if two documents are mutual translations then the cognates areusually kept the same in both of them The cognates are words that are spelledsimilarly in two languages, or words that simply are not translated (e.g., abbrevi-ations) For example, if the word “WTO” appears in an English text, it probablyalso appears as “WTO” in a Vietnamese text Note that, [30] also use cognatesbut for sentence alignment We divide a token which is considered as a cognateinto three types as follows:

re-1 The abbreviations (e.g., “EU”, “WTO”),

2 The proper nouns in English (e.g., “Vietnam”, “Paris”), and

Trang 26

3 The numbers (e.g., “10”, “2010”).

Now, we can design a feature to measure the similarity of content based on nates This feature is estimated by the rate between the number of correspondingcognates between the two texts and the number of tokens in one text (e.g., forEnglish text) Given a pair of texts (Etext,V text) where Etext stands for Englishand V text stands for Vietnamese, we respectively obtained the token set of cog-nates T1 and T2 from Etext and V text For a robust matching between cognates,

cog-we make some modifications of the original token:

• A number, which is written as sequence of letters in the English alphabet,

is converted into a real number According to our observations, units of thenumbers in English are often retained when translated into Vietnamese So,

we do not consider in case the units are different (e.g., inch vs cm, pound vs

kg, USD vs VND, etc)

• We use a list which contains the corresponding names between English andVietnamese They include names of countries, continents, date, etc How-ever, the names of countries in English can be translated into Vietnamese indifferent ways Therefore, we only consider these names in English, whichVietnamese names corresponding have been published on Wikipedia site2

Figure3.5 is an example of two corresponding texts of English and Vietnamese.Scanning these texts, we obtained T1={“Vietnam”, “Italy”, “1998”, “60”, “40”, }and T2 ={“1998”, “Vietnam”, “Italy”, “60”, “40”, } We measure the similarity

of cognates between Etext and V text by using the algorithm presented in Figure

3.6

If simcognates(Etext, V text) greater than a threshold then the pair (Etext,V text)

is a candidate The simcognates(Etext, V text) is calculated as in formula (3.1)

simcognates(Etext, V text) = count

Number of tokens in T1 (3.1)where,

• count: the number of tokens matching between T1 and T2 (as algorithm inFigure3.6)

2 http://vi.wikipedia.org/wiki/Danh sach quoc gia

Định dạng
Số trang	53
Dung lượng	1,45 MB