1. Trang chủ
  2. » Ngoại Ngữ

Japanese vietnamese compound word machine translation

81 271 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 81
Dung lượng 1,25 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Therefore, translating Japanese compound words to other languages is indispensable in multilingual processing applications such as machine translation and mutlilingual information retrie

Trang 1

Keio University Master‟s Thesis Academic Year 2009

Keio University – Graduate School of Media and Governance

Vo Ho Bao Khanh

Japanese – Vietnamese compound

word machine translation

Trang 2

Abstract of Master‟s Thesis Academic Year 2009

Summary

Compound words are highly frequent and productive in many languages, especially

in Japanese Therefore, translating Japanese compound words to other languages is indispensable in multilingual processing applications such as machine translation and mutlilingual information retrieval that require the exact translations of compounds of words However, compound word translation is a difficult and challenging task due to variable compound structures, multiple relations of constituents within compound words, and several possible translations of each constituent Machine translation of Japanese compound words to other popular languages, yet to less popular language, has been done to solve these difficulties thanks to the adequate multilingual data and the excellent morphological processing tools of popular languages We aim to automatically translate Japanese compound words to a less popular language in this thesis with the hope to propose a general translation framework for less popular languages We chose Vietnamese as the target language because the need of understanding Japanese in Vietnam has become a huge demand since the two countries established a full-fledged relation

Our proposed approach, in general, consists of two phases: generation and selection In the first phase, we applied the morphologically-based compositional translation method that utilized the inflected features, the grammatical features and the semantic links of Japanese constituents to generate translation candidates regardless of the sparseness of the dictionary We then selected the most likely translation candidates by evaluating the term frequencies of the generated Vietnamese candidates with the help of available search engines This selection method is approriate for less popular languages which are limited at language processing tools and good quality corpora We developed the actual implementation

of our approach and evaluated it for various data sets The results show that our approach is effective and adaptive for not only Vietnamese but also other languages

It can also be applicable for other researches such as multiword expression compilation and multilingual information retrieval research

Keywords:

1 Machine translation, 2 Compound word, 3 Japanese, 4 Vietnamese, 5 Less popular language

Japanese – Vietnamese compound

word machine translation

Trang 3

Vo Ho Bao Khanh

修士論文要旨2009年度(平成 21 年度)

論文要旨

複合語は多くの言語において頻度高く現れ、新たに作られやすい。日本語では特にそのような特徴がある。従って、日本語の複合語を他の言語に精度良く翻訳することは、機械翻訳だけでなく多言語情報検索のような多言語処理の応用分野でも必要不可欠である。しかし、複合語は多様であり、その構成語間には様々な関係がある。また構成語に複数の翻訳が可能である場合には複合語の翻訳は高度になる。マイナーな言語への機械翻訳では、その言語のコーパスを利用し、メジャーな言語のための優れた形態素処理ツールを活用して、上記の難しさを解決すべく研究が進められてきている。本論文では、日本語の複合語をマイナーな言語に機械翻訳するための一般的な方法の枠組みを提案することを目的とする。さらに、日本語ベトナム語間の機械翻訳を研究することにより、ベトナムと日本の相互理解を促進し、良好な関係を育てることを目指す。

提案したアプローチは、翻訳における訳語の生成と選択の二つの段階から成り立っている。第1の段階では、日本語の形態素に基づいて構成的な手法による翻訳を行う。そこでは、語尾変化の特徴、文法的特徴や意味的リンクを用いて、用意した辞書に語が十分に豊富でなくても、訳語候補を生成することができる。第2段階では、検索エンジンを用いて、生成したベトナム語候補の単語出現頻度を計算することにより、最も可能性の高い訳語候補を選ぶ。この選択法は言語処理ツールと良質のコーパスが十分でないマイナーな言語を扱うのに適している。本研究は、提案した手法を実行し、様々なデータセットを用いて評価した。その結果から、提案したアプローチがベトナム語のみならず他の言語に対しても有効に適用可能であることを示している。また、慣用句のような多単語表現の処理や多言語情報検索などの研究へも将来は応用可能と考える。

Trang 4

ボ ホ バオ カイン

Trang 5

Acknowledgements

First and foremost, I wish to express my most heartfelt gratitude and appreciation to Prof Shun Ishizaki and Dr Kiyoko Uchiyama for their intellectual and tireless guidance, invaluable advices, and indispensable cooperation throughout the process

of completing this study Their unlimited enthusiasms and encouragements have been the major forces leading me to natural language processing research in specific and computer science research in general I am deeply indebted to Prof Kuniaki Mukai and Prof Hiroaki Saito who have given me insightful comments and valuable suggestions for the significant improvement of this thesis I would like

to thank to Prof Hiroshi Nakagawa, Tokyo University for the interesting discussions and constructive criticisms that generated several helpful points for my research Many thanks go to the committee members of SFC, Keio University who gave valuable opinions about my thesis

I would like to send the deepest thanks to all members of Ishizaki Laboratory including: visiting researchers Dr Jun Okamoto and Dr Miyoko Nakamura; alumni Kyota Tsutsumida, Shunsuke Aihara, and Shiro Asano; graduate students Yoshino Kayoko, Takehiro Teraoka, Nao Tatsumi, Chie Nakamura, Hiroi Kubota, and many other graduate and undergraduate students for helping me and sharing with me throughout my school life Ishizaki Laboratory members made me feel like home

I am grateful to Assoc Prof Dinh Dien and members of VCL research group for their kind advices rendered to me Many thanks address to all members of HESPI scholarship program and JICE staff who supported me in various ways Not forget to say thank you to Vietnamese friends at Keio University and in Japan who have been playing an important role in my life since I started to study in Japan

I would like to take this special occasion to thank my parents and my younger sisters for their uncompromising love and support A special thank go to my soul mate, Nguyen Dinh Tu, who always inspires and takes care of me while I was working on the thesis I could not finish this study without the kindness of my nearest and dearest

Trang 6

Vo Ho Bao Khanh – ボ ホ バオ カイン

List of Tables

Table 2.1: JCN composition 10

Table 2.2: JCV classes 11

Table 2.3: Vietnamese language tones 14

Table 2.4: Ratio of Sino-Vietnamese equivalents (two-kanji words) in JLPT tests 16

Table 3.1: Semantic relations and paraphrase 24

Table 4.1: XML schema of the Japanese – Vietnamese dictionary 33

Table 4.2: XML schema of search engine results 34

Table 4.3: XML schema of search engine result analysis data 35

Table 4.4: XML schema of final Vietnamese results 36

Table 4.5: Relations of JCN constituents 40

Table 4.6: Translation templates of compound nouns with many constituents 48

Table 5.1: Datasets 57

Table 6.1: Two-word constituent evaluation 62

Table 6.2: More than two word constituent translation 62

Table 6.3: Two-word constituent evaluation between SEs 63

Table 6.4: More than two word constituent translation between SEs 63

Table 6.5: JCV translation accuracy 64

Trang 7

List of Figures

Figure 3.1: General architecture of MWE/CW translation 18

Figure 3.2: Non-existing dependency relations within JCWs 22

Figure 3.3: Semantic factors for motion verbs 25

Figure 4.1: General function hierarchy diagram 30

Figure 4.2: Main processing model 31

Figure 4.3: Search engine processing model 51

Figure 4.4: List of keywords and their hits 51

Figure 4.5: Search engine result combination 52

Figure 5.1: A constituent translation example 58

Trang 9

Contents

Acknowledgements v

List of Tables vi

List of Figures vii

Chapter 1 Introduction 1

1.1 Motivation and problem statements 1

1.2 Outline 3

1.3 Contribution 4

1.4 Structure of the thesis 4

Chapter 2 Background 6

2.1 Multiword expression and compound word 6

2.1.1 Multiword expression 6

2.1.2 Compound words 8

2.2 Japanese compound word 9

2.2.1 Japanese compound noun 9

2.2.2 Japanese compound verb 11

2.3 Japanese compound word translation 12

2.4 Vietnamese language and related problems 13

2.5 Japanese – Vietnamese language researches 15

2.6 Summary 16

Chapter 3 Related works 17

3.1 Compositional translation method 18

3.1.1 Lexically-based compositional method 19

3.1.2 Morphologically-based compositional method 19

3.2 Semantic relation analysis of JCW constituents 22

3.3 Selection of generated translation candidates 25

3.4 Summary 28

Chapter 4 Models and methods 29

4.1 Analysis and design translation system 29

4.1.1 General analysis and design 29

Trang 10

4.1.2 Data structure 32

4.2 Translation template compilation 36

4.2.1 Grammatical and semantic feature analysis of JCN 37

4.2.2 Translation templates proposition 41

4.2.3 Translation template verification 46

4.2.4 Translation rules for JCV translation 48

4.3 Selection method 50

4.3.1 The highest result-based selection 52

4.3.2 Search engine confidence-based selection 52

4.4 Summary 53

Chapter 5 Implementation and experiment 54

5.1 Data collection 54

5.1.1 Data extraction from Japanese – Vietnamese dictionary (dataset 1) 54

5.1.2 Data extraction from other resource 56

5.1.3 Manually inputted data 56

5.2 Compound noun translation 57

5.2.1 Compositional translation 57

5.2.2 Selection experiments 58

5.3 Compound verb translation 60

5.3.1 Compositional translation 60

5.3.2 Selection 60

5.4 Summary 60

Chapter 6 Evaluation 61

6.1 Evaluation method and criteria 61

6.2 Evaluation of compound noun translation 61

6.2.1 Evaluation of the highest result selection 61

6.2.2 Evaluation with the selection between SEs 62

6.3 Evaluation of compound verb translation 63

6.4 Summary 64

Chapter 7 Conclusion and future work 65

7.1 Conclusion 65

Trang 11

7.2 Future work 66 Bibliography 67 List of Publications 70

Trang 12

Chapter 1

Introduction

In this chapter, we firstly present general motivations and problem statements leading to translating Japanese compound words to Vietnamese Next, contributions of this study will be listed, and the thesis structure will end this chapter

1.1 Motivation and problem statements

Along with the development of the Internet, opportunities and challenges of interacting with multilingual information also evolve Some of these challenges are being addressed by the field of human language technology including natural language processing and computational linguistics Typically, the problem that information is often available in one language but not another language is being undertaken by machine translation and cross-lingual information retrieval Available machine translation systems and multilingual search engines such as Systran, Yahoo! Babel Fish, Bing and Google are becoming more and more important in the virtual world They can translate and retrieve information quite well for popular languages However, it is still a long way to reach the perfect translation goal, especially for less popular languages, because of the varying, productive nature of natural language Therefore, machine translation is always a hot topic, and its sub-problem, compound word translation is not also an exception In fact, the translation of compound words which mainly include compound nouns is a major issue in machine translation due to their frequency of occurrence and high productivity [19]

Since compound words are particularly popular and productive in Japanese, Japanese compound word translation is an indispensable task in translating Japanese to other languages Several translation works of Japanese compound

Trang 13

noun to popular languages have been done to serve other purposes like multilingual information retrieval [6] or multiword expression compilation [11] These works utilized the advantage of available resources such as bilingual corpora and morphological processing tools Nevertheless, there is no typical work of translating Japanese to other less popular languages with less available resources despite the increasing needs of communication Thus, in this research,

we would like to translate Japanese compound words to this kind of language We decided to choose Vietnamese as an illustrator of less popular languages because:

 Japanese became very popular in Vietnam since Japan is one of the biggest investors there Successful Japanese – Vietnamese machine translation systems, hence, can bring different gains not only in communication but also

in economics and trade

 On the other hand, only a few researches of Japanese – Vietnamese machine translation have been done so far Translation of adnominal modification structures [12] is the most dominant work This leads us to a resolution of translating Japanese compound word to Vietnamese so as to enrich computational linguistic researches of both languages

 At a linguistic aspect, the structural similarity of words consisting of two Japanese kanji and two Sino-Vietnamese words which is about 60% [5] motivates us to automatically translate compound words in order to examine related linguistic problems

 We hope that from these linguistic results, we can find basic principles of Vietnamese compound word structures, and the influence of Chinese words in both Japanese and Vietnamese

In order to fulfill these motivation objectives, we need to understand the current status of the Japanese compound word translation and language resources As the full investigation will be presented in Chapter 2, we would like to generalize the problems as follows:

 How to find and make usable compound word data in both languages while there is no available bilingual data?

 How to translate Japanese compound words to Vietnamese?

Trang 14

 How to evaluate translation results?

1.2 Outline

The purpose of this study is to translate Japanese compound words to Vietnamese Because multiple constituents create a unique compound word, the compositional translation method is often used to translate compound words [19][11] Unfortunately, as one constituent may have numerous translations in another language, it is challenging to select the most relevant translation between them

For example, English translations of a Japanese constituent “計画 – keikaku” are different in the individual compound words: 配 布 ・ 計 画 haifu-keikaku

“distribution schedule” vs 経 済 ・ 計 画 keizai-keikaku “economic plan/programme” vs 主要・計画 shuyou-keikaku “major project” Furthermore,

the compound word translation has other problems relating to the context of translation, and to the constructional variability in part-of-speech (POS) differences of translations in both languages The context of translation often decides which compound word translation is correct among translation candidates These following examples of the constructional variability in the translations of

Japanese and English: 機械・学習 kikai-honyaku “machine learning” (N-N) vs 経済・発展 keizai-hatten “economic development” (Adj-N) vs 関係・改善

kankei-kaizen “improvement in relations” (N in N) indicate the challenges in

compound word translation

A lot of researchers used parallel/comparable corpus to get correct translations among translation candidates [19] Monolingual target corpus can also be used for calculating corpus evidence for both the fully-specified translation and its translated constituents [4] Japanese compound words, however, mostly appear in technical documents whose translations are not always available in other languages, especially less popular languages like Vietnamese Vietnamese monolingual corpora that have been morphologically processed are not also available Considering such situation, we need to find other resources for selecting correct translations We adopted a method that we could find Vietnamese technical terms from the Web since many documents are updated every day in the Internet

Trang 15

Consequently, we decided to utilize search engines to search these technical terms for us when the competition between available search engines enhanced their searching qualities much more

1.3 Contribution

By answering the above research questions, our objectives and contributions include:

 Creating usable compound word data in both languages

 Proposing translation methods for not only Japanese – Vietnamese compound words but also for translating Japanese compound words to other less popular languages

 Widening Vietnamese technical terms and specialized expressions

 Contributing to the development of other researches such as sentence translation, multilingual information retrieval, multiword dictionary compilation, etc

 Enriching linguistic researches relating to Japanese, Vietnamese and other Oriental languages

 Partly contributing to shortening the language distance between Japan and Vietnam

1.4 Structure of the thesis

In order to provide the overall view of this thesis, we would like to introduce its structure It is organized in seven chapters including this Introduction chapter The rest of the thesis comprises the following main parts:

 Chapter 2: Background: This chapter mainly mentions linguistic aspects of Japanese and Vietnamese compound words, and then introduces Japanese – Vietnamese language processing researches and resources Finally, we discuss methods of translating Japanese compound words

 Chapter 3: Related works: This part presents different methods about analyzing and translating Japanese compound word which directly relate to

Trang 17

Chapter 2

Background

This chapter mainly mentions background knowledge about linguistic aspects of Japanese and Vietnamese compound words, then presents methods of translating Japanese compound words, and finally introduces Japanese – Vietnamese language processing researches so far

2.1 Multiword expression and compound word

Compound words and multiword expressions have a close relationship in terms of their definitions Multiword expressions include most of compound words; therefore we would like to define multiword expressions before going to compound words

2.1.1 Multiword expression

According to Sag, Baldwin, et al [17], multiword expressions (MWEs) are lexical items that: (a) can be decomposed into multiple lexemes; and (b) display lexical, syntactic, semantic, pragmatic and/or statistical idiomaticity

Because MWE decomposition into multiple lexemes is not similar in different languages, determining an expression is whether an MWE or not also varies across language types In English, MWEs are often regarded as “word with spaces” [2] In other words, MWEs must be made up of multiple

whitespace-delimited words Online marketing, for example, is an MWE as it is made up of two lexemes online and marketing In Japanese, it is more

sophisticated to determine MWEs because Japanese is a non-segmenting language MWE determination almost depends on the convention about standalone lexeme

For instance, 機械・翻訳 kikai-honyaku “machine translation” is an MWE as

Trang 18

both kikai “machine” and honyaku “translation” are standalone lexemes, but 部長

buchou “department head” is not an MWE as bu “department” is a standalone

lexeme, but chou “head” is not

The second requirement for an MWE is its idiomatic property In the context

of MWEs, idiomaticity refers to distinctiveness or deviation from the basic properties of the component lexemes, and applies at the lexical, syntactic, semantic, pragmatic, and/or statistical levels A given MWE is often idiomatic at multiple levels (e.g syntactic, semantic and statistical in the case of the

expression by and large) [3]

MWEs can be broadly classified into lexicalized phrases and institutionalized phrases [2] Lexicalized phrases have at least partially idiosyncratic syntax or semantics, or contain „words‟ which do not occur in isolation; they can be further broken down into fixed expressions, semi-fixed expressions and syntactically-flexible expressions, in roughly decreasing order of lexical rigidity Institutionalized phrases are syntactically and semantically compositional, but occur with markedly high frequency (in a given context)

There are three main types of MWEs: nominal MWEs, verbal MWEs, and prepositional MWEs Nominal MWEs are one of the most common MWE types,

in terms of token frequency, type frequency, and their occurrence in the world‟s languages [19][3] The primary type of a nominal MWE in English is the noun compound (NC), where two or more nouns combine to form a noun, such as

computer science department; the rightmost noun in the NC is termed the head

noun [3] Verbal MWEs has different constructions: verb-particle constructions

being made up of a verb and an obligatory particle (e.g cut short, take off, let go);

prepositional verbs being comprised of a verb and selected preposition, and the

preposition is transitive (e.g refer to, look for); light-verb constructions being

made of a verb and a noun complement, often in the indefinite singular form (e.g

do a demo, give a kiss, have a drink, make an offer, take a bath); verb-noun

idiomatic combinations being composed of a verb and noun in direct object

position, at least semantically idiomatic (e.g kick the bucket, shoot the breeze)

Prepositional MWEs consists of two kinds of MWEs:

Trang 19

determinerless-prepositional phrase being made up of a preposition and a singular

noun without a determiner (e.g by bus, in school); complex prepositions which can take the form of fixed MWEs (e.g in addition to), or alternatively semi-fixed MWEs, for example optionally allowing internal modification (e.g with

(due/particular/special/ ) regard to) or determiner insertion (e.g on (the) top of )

At the MWE aspect, compound word can be generally considered as another name, an interpretation of an MWE [5] and cover some types of MWEs In other words, compound word can refer to an MWE where the component words or constituents are written together to assume a syntactic function Therefore, compound word is also idiomatic at multiple levels For example, joined words

like sleepwalk and darkroom are conventionally not classified as MWEs but

compound words

According to the above classifications of MWEs, compound words are decomposable idioms in syntactically-flexible expressions and institutionalized phrase Typically, nominal MWEs, an institutionalized phrase type, are compound nouns where two or more nouns combine to form a noun, and the rightmost noun

is termed the head noun Compound nouns are the most popular compound words

in almost languages The fact that verbal MWEs and prepositional MWEs can include compound verbs and prepositional compound words depends on each language structure Besides, verb-noun compounds, noun-verb compounds, and adjective-noun compounds also exist in some languages Because of this language relevance, “compound word” is the more common term used to express the

Trang 20

composition of single words in some languages than the term “MWE” Besides, in computational linguistics, it is difficult to process wide spectrum problems of MWEs, so researches primarily concentrate on processing minor problems such as compound word problems We decided to use the term “compound word” for processing Japanese in the view of the fact that researches are concentrating on solving compound word problems (see Section 2.2) than many language processing problems of MWEs [2], and compound words are very frequent and productive in Japan than other types of MWEs

2.2 Japanese compound word

Japanese compound words (JCWs) are highly frequent and productive in Japanese documents, especially in specialized documents because of their expressive power

to the language and new specialized terminologies emerge almost every day [19] [8] Indeed, new words can be created frequently and easily by combining single words/ lexemes together In contrast, it is not easy to analyze these compound words into constituents [8] The compound word example 元 / 日 銀 / 総 裁

moto/nichigin/sousai has two probable dependency structures: ((元) ((日銀)(総

裁))) “ the former president of Bank of Japan” and (((元) (日銀)) (総裁)) “the president of former Bank of Japan” in which only the first structure is correct Like the above definition of compound word, Japanese compound word is also

a sequence of constituent words assuming a syntactic function For instance, the

compound word 情報・検索 jouhou-kensaku “information retrieval” consists of two words jouhou “information” and kensaku “retrieval” is idiomatic and

functions as a noun Although the number of constituents may vary from two to an indefinite number, most of JCWs have from two to five constituents, especially two constituents It is also reported that a large number of JCWs are compound nouns beside compound verbs [25]

2.2.1 Japanese compound noun

Japanese compound nouns (JCNs) consisting of two constituents are the main

Trang 21

subject of many JCN grammatical, semantic analysis and translation researches because of their high frequency in JCNs According to the grammatical analysis of Uchiyama [20], JCNs are composed of Chinese characters (kanji) and native Japanese words; but most of Japanese compound nouns consist of Chinese characters If they have the following POS information:

 Kanji: Noun, Verbal noun (in Japanese: サ 変 ・ 動 詞 形 ・ 名 詞

ashen-doushikei-meishi), Adjective noun

 Native Japanese word: Noun, Conjunctive noun, Adjective noun stem then the composition of JCNs will be:

Table 2.1: JCN composition

Noun + Verbal noun 特許・検索 tokkyo-kensaku “patent search”

Noun + Noun 特許・文書 tokkyo-bunsho “patent documents”

Adjective noun stem +

Noun

特殊・文字 tokushu-moji “special character”

Conjunctive noun + Noun 読み取り・装置 yomitori-souchi “reading

device”

Adjective noun stem +

Conjunctive noun

斜め・読み naname-yomi “oblique reading”

From this grammatical analysis of two-constituent-JCNs, Uchiyama et al [24] identified semantic relations in JCNs for extracting and paraphrasing JCNs in patent document analysis These relations will be discussed more specifically in Chapter 3

Nevertheless, from our investigation, linguistic analysis of JCNs consisting of more than two words like the above analysis has not been done These JCNs are classified as nouns in general for automatic processing with the perception that the rightmost constituent of JCNs is the headword and other words depend always on

a constituent to its right Basing on this perception and mutual information of constituents, Dongli Han et al [7] automatically analyzed structures of more-than-two-constituent JCNs quite well, and overwhelmed other prior analysis

Trang 22

researches

2.2.2 Japanese compound verb

The compound verb in Japanese (JCV) is a concatenation of two or more verbs which functions as a single multiword verb There are several classes of JCVs like

in the following table:

Table 2.2: JCV classes

食べ・始める tabe-hajimeru “to start to eat”

Verb + Adjective 蒸し・暑い mushi-atsui “steaming hot”

Verbal noun-suru 休憩・する kyukei-suru “take a rest”

Among these classes, the class “Verbal noun-suru” is often available in dictionaries, and the most common and largest class “Verb + Verb” has the first

verb in the continuative form (also known as masu-stem because it forms the base for the polite spoken –masu group of inflections), and constituents are native

Japanese verbs, not loanwords or Chinese words [22]

From Table 2.2: JCV classes, we can see that JCVs play an important role in Japanese in which they are analogous to several different structures in other

languages English equivalents include compound verbs (e.g to start to eat), verb-plus-gerund (e.g to try eating) and verb particle constructions (e.g to jump

out, to pull down), etc The compositional structure “Verb + Verb” is the most

common structure but difficult to translate with various translations of a same constituent in different JCVs and its idiosyncratic meaning

However, it is difficult to translate JCVs to English automatically because the variety of idiosyncratic English translations For that reason, JCV researches in computational linguistics often stop at grammatical, semantic analysis, translation

Trang 23

rule proposition and extraction of JCVs

2.3 Japanese compound word translation

The Japanese compound word translation is primarily the translation of JCNs because there is no good corpus for translating JCVs while the idiosyncratic translations in other languages are very popular Until now, only the work of Uchiyama et al [21] proposed translation rules without experiment for translating motion JCVs to English

As for compound noun translation for bilingual applications such as bilingual dictionary construction, machine translation, and cross-lingual information retrieval, most works concentrated on two-constituent JCNs (also known as NN compounds) The researches relating to Japanese – English compound noun translation often meet these difficulties: the constructional variability in English translation, lexical idiosyncrasies in both languages, and non-compositional NN compounds [19] To translate Japanese compound nouns to English, the proposed approaches have to process these difficulties Besides, the available resources of research groups and their needs usually decide the approach to be used There are some methods depend on other properties like that, such as: rewriting Japanese compounds to equivalent expressions that can be used for machine translation systems [8]; translating source base words to target words, and reducing the sparseness afterward by evaluating the precision of candidate words by using general resources such as target collocation dictionary, or Web data [19][18] Another bilingual Japanese – Chinese compound noun research used English as a pivot [25] because it could explore Web data and utilize English translations to link words of the source language and the correspondences in the target language; then collected new compounds by using composition patterns of already known words; and finally acquired their translations thanks to the availability of candidates in Web, or in Japanese-English and English-Chinese phrase translation corpora This work generally cannot be applied for Vietnamese which is short of English translations appearing beside new Vietnamese words In short, these researches need at least a bilingual dictionary and another resource which can be parallel

Trang 24

corpus or compatible corpus or monolingual corpus or Web data; and NN

compounds are mainly processed instead of compound words in general

Among those approaches, we recognized that the method mostly used to translate compound nouns consists of two basic stages: generation and selection [19] In generation phase, compositional translation method is used to generate compound nouns in a source language into translation candidates in a target language due to the compositional property of almost half of compound noun translations – 48.7% were reported by Baldwin and Tanaka [4] for English – Japanese NN compound translations, and 75.1% of the primarily collected French – Japanese MWEs of Robitaille et al [16] The most likely candidate then can be selected among these candidates by scoring the corpus-based translation quality of each candidate in monolingual corpus and cross-lingual data [4], or by applying the direct context-vector approach in comparable corpus [11] All these selection methods can be well performed on account of available revised corpora and fundamental language processing tools such as morphological processing tools, and POS tagger tool

2.4 Vietnamese language and related problems

Vietnamese is primarily spoken in Vietnam by more than 80 million people It is a member of the Mon-Khmer branch of the Austro-Asiatic language family Other Mon-Khmer languages include Mon, which is spoken in Myanmar; Khmer, which

is spoken in Cambodia; and Muong, which is also spoken in Vietnam Vietnamese used a character-based writing system that was similar to Chinese for administrative purposes, poetry and literature, but not used widely in Vietnamese people before the 20th century In 1910, a romanized script that had been devised

by Catholic missionaries in the 17th Century was adopted as the official Vietnamese alphabet This national writing system is still in use today

The Vietnamese alphabet consists of 17 consonants and 12 vowels Vietnamese is a tonal language, meaning that the tone or pitch used when a word

is pronounced helps determine its meaning There are six distinct tones in Vietnamese: the level tone (1), the high rising tone (2), the low falling tone (3),

Trang 25

the low rising tone (4), the high rising broken tone (5), and the low broken tone (6) like in the following table:

Table 2.3: Vietnamese language tones

Vietnamese grammar tends to use word order and sentence structure to convey grammatical meanings, rather than inflections within words This is different from Japanese, a highly inflected language Many grammatical concepts which would

be expressed by word changes in English and other languages are expressed with particles in Vietnamese These particles are generally short words that cannot readily be translated into English They fulfill a variety of functions in Vietnamese, from indicating tense to increasing the politeness of a sentence Reduplication (repeating a word or part of word) is another noteworthy element of Vietnamese grammar Vietnamese syntax conforms to “Subject + Verb + Object “word order which is different from Japanese syntax

Lexically, because the Chinese ruled Vietnam for hundreds of years, the Vietnamese language consists of about 60% Chinese words [1] More recently, Vietnamese has also borrowed words from French and English As for relating to Japanese, Matsuda et al [10] claimed that among 4000 Sino-Vietnamese compound words and 8000 Japanese words, 50% of all words are homogeneous or similar words The similarity rate of Kanji and Sino-Vietnamese is over 60%; and the rate may be higher for academic words and lexicons Therefore, similar to Chinese and Japanese which surely need morphological processing tools including word segmentation, and POS tagging in natural language processing, Vietnamese also needs morphological processing tools In spite of the increasing needs of these tools, computational linguistic research tools have not been developed satisfactorily in Vietnam Available morphological tools only present fairly good results in word segmentation but not POS tagging for the reason that Vietnamese POS definitions are not integrated until now

The definition of Vietnamese compound word is different from Japanese

Trang 26

compound word definition Many Vietnamese nouns and verbs that are made up

of individual segments of one of two or more syllables that work together are believed as compound words according to Vietnamese linguistics These individual segments may have no meaning, or an entirely different meaning when taken out of context Sometimes two words with the same meaning are combined for emphasis In Vietnamese, the flow and rhyme of the language are very important All of the elements in a sentence work together to make a language which is often said to resemble poetry It is clear that Vietnamese compound word definition is different from Japanese compound word definition There are two main kinds of this compound word definition: compound words composed by semantic coordination (e.g quần/trousers, áo/shirt - quần áo/clothes) and compound words composed by semantic subordination (e.g xe/vehicle, đạp/pedal – xe đạp/bicycle) These compound words are usually available in dictionaries Considering these facts, we would like to use an ultimately new concept about compound word that is similar to Japanese when translating Japanese compound words to Vietnamese Vietnamese compound word in this case is also a sequence

of constituent words (nouns or verbs composed of two or more syllables) assuming a syntactic function

From the mentioned influence of Chinese in both languages, especially at word level, it seems reasonable to directly process Japanese – Vietnamese bilingual compound words rather than using Roman languages such as English and French

as the pivot language

2.5 Japanese – Vietnamese language researches

Only a few linguistic researches relating to both languages have been done so far Matsuda et al [10] presented the similarity of Japanese and Vietnamese words in terms of Chinese influence in both languages by statistics of Sino-Vietnamese translations of two-kanji words used in different Japanese Language Proficiency Test (JLPT) levels as in the Table 2.4: Ratio of Sino-Vietnamese

Trang 27

Table 2.4: Ratio of Sino-Vietnamese equivalents (two-kanji words) in JLPT

tests

is no work for solving the translation problems of compound words We hope our compound word translation study will effectively contribute to Japanese – Vietnamese language processing researches

2.6 Summary

In this chapter, we have drawn background information of Japanese and Vietnamese compound words as well as presented general introduction of the Japanese compound word translation We also indicated the method frequently used for MWE and compound word translation We decided to apply this method for our translation problem because other methods mentioned above are not applicable to a less resourced language like Vietnamese This method and its related works will be explained clearly in the next chapter

Trang 28

Chapter 3

Related works

Among compound word translation methods have been discussed in Chapter 2,

we decided to choose the method frequently used for MWE and compound word (CW) translation This method consists of two phases: generation and selection (see Figure 3.1: General architecture of MWE/CW translation) In the generation phase, the compositional translation method is used to translate compound nouns in

a source language into translation candidates in a target language with the help of

a bilingual dictionary This method examines grammatical structures of compound word constituents in both languages to make translation templates In some cases, semantic relations of compound words can also be used to make these templates The most likely candidate then can be selected among these candidates by different selection methods, and this phase is called selection The details of these phases and the JCW grammatical and semantic structure analyses are explained in this chapter

Trang 29

Figure 3.1: General architecture of MWE/CW translation

3.1 Compositional translation method

In order to compile the dictionary of MWEs that could not be directly translated

as well as translate compound words, the compositional property of MWEs and compound words are often utilized [11] The assumption that the compositional

NN compound translation of collected Japanese to English, English to Japanese, and French – Japanese MWEs were 43.1%, 48.7% and 75.1% respectively is evident for the compositional property [19][16] Compositionality is defined as the property that the “the meaning of the whole is a function of the meaning of the parts” [9] In the translation process, compositionality can be considered as translating a whole sequence by translating each part individually, and then by appropriately piecing together the translated parts In the most recent research about MWE translation, Morin and Daille [11] classify compositional translation method into two types: lexically-based compositional method and morphologically-based compositional method

Trang 30

3.1.1 Lexically-based compositional method

The lexically-based compositional translation method can be implemented through the three main steps:

1 Translate each constituent of a compound word by looking in a dictionary The lexical form of constituents is examined without checking the POS The number of generated translations is ) where is the number of

translations of each content word of the source compound word and is the number of content words

2 All possible combinations of the translated constituents are generated regardless of word order

3 From the set of translation candidates, the most likely translations according to term frequency are selected

The number of generated translations can be reduced or increased using compound word POS patterns in the source and the target languages which are called translation templates These translation templates are created by examining sample translation data For example, Tanaka and Baldwin [19] defined the following templates to filter translation candidates: N1N2 Japanese compound word structure is translated to N1N2 (33.2% of the cases), Adj1N2 (28.4%), N2 of (the) N1 (4.4%) of English structure The compound word, however, will not be taken into account in the translation step if it is not possible to translate all the constituents of a compound word, or when the translated combinations are not identified in the selection phase

3.1.2 Morphologically-based compositional method

The lexically-based compositional translation method is restrictive in case dictionary data is insufficient to translate some compound words Robitaille et al

[16] therefore proposed a backing-off method: if an MWE of length n cannot be translated directly, a scaled MWE of length less than or equal to n is used instead

Trang 31

for translating MWEs from French to Japanese This approach makes the direct translation of each subpart, a kind of shorter MWE elements, whenever its

translation is available in the dictionary For instance, the French term syndrome

de fatigue chronique “chronic fatigue disorder” yields the following four

combinations: [syndrome de fatigue chronique], [syndrome de fatigue] [chronique], [syndrome] [fatigue chronique] and [syndrome] [fatigue] [chronique]

Morin and Daille [11] proposed the morphologically-based compositional method which is a generalization of the backing-off method proposed by Robitaille et al [16] In this method, instead of skipping a word when it does not appear in the dictionary, they tried to link it to a word in the dictionary using morphological knowledge which utilizes the productivity in derivation process of French in specific and Romance languages in general They assumed that derivational morphology is a compositional process that should be part of the translation process

Morin and Daille [11] also noticed that when the derivation process keeps its compositional meaning, the two forms, the original form and its derivation, are

semantically linked The –er affix in English whose the semantic perception would be “one who does x” such as “baker/to bake” When the derived word has

taken on an idiosyncratic meaning from the input word, they assumed that the neutral form also retains its compositional meaning through the translation

process An example would be the norme “standard” / normal “usual” pair:

normal “usual” has lost its compositional interpretation in French, but could be

translated by the “standard” stem meaning in another language In their case of translating French MWEs to Japanese, they transformed the derived nominal or adjectival to a neutral form using stripping-recoding rules Stripping-recoding rules label an MWT pattern and give as output candidate MWEs of another pattern by removing the affix, normalizing the stem to undo phonological change, and generating a neutral form The generated form is a lemma that possibly

characterized either a verb (retrieval/retrieve), a noun (spotless/spot), or an adjective (gravity/grave) depending on the MWE patterns

Considering these compositional methods for Japanese – Vietnamese

Trang 32

compound word translation, we can adopt to a conclusion that morphologically-based compositional method can be applicable because Japanese

is an agglutinative language In other words, if translation of a JCW constituent is not available in the dictionary, we can make use of the translations of its derivation or original form in the dictionary For example, if the translations of a verbal noun are not available or they are not as many as translations of “verbal noun + suru”, we may get all the translations of the verbal noun and “verbal noun + suru” to make sure that we do not miss other important translations However,

we will meet some questions like this: How to identify a constituent as a verbal noun? What happen if we translate a verbal noun or an adjective noun? Does the relation between a verbal noun and another constituent exist? To answer these questions, we need to classify grammatical features and semantic links of JCW constituents such as classifying a constituent as a verbal noun and its relation with the rest constituents in JCW in the prior example To state this differently, the advantage about semantic link between the derivation and the original form of a compound of this morphologically-based compositional translation method intrigued us to study the grammatical property of each constituent and the semantic relation of constituents of JCWs to apply for the compositional translation Semantic relations of JCW constituents in this case can be inferred from grammatical and semantic features of constituents Relating to the lexically-based compositional translation method, these grammatical and semantic features can be used for making translation templates because if the classification

of these features is comprehensive, it may be better than using the sample translation data in case the translation data is not adequate We thus decided to use these features and sample data to compile translation templates, and use morphologically-based translation method to improve the number of translations

We would like to name this new method of as extended morphologically-based compositional translation method To sum up, this method is quite different from the morphologically-based translation method in terms of our method examining the semantic relation of constituents but not only the semantic relation between the derivation and the original form of each constituent in a compound word

Trang 33

In order to adopt the conclusions about these methods, we will present our experiments of these methods in Chapter 4 and evaluate these methods in Chapter

6 However, to make clear about the semantic relation of JCW constituents, we think it is better to present related JCW analysis researches

3.2 Semantic relation analysis of JCW constituents

According to Dongli Han et al [7], each word in the compound word, except the rightmost one or headword, relates to, or depends semantically on, one of other words in its right more strongly than to any other words They showed the

example that 関西 kansai “Kansai” relates more strongly to 空港 kuukou

“airport” than to 国際 kokusai “international” in the compound word 関西・国

際 ・ 空 港 “Kansai International Airport” They concluded that a Japanese compound word that the semantic dependency relations among its constituents have the following characteristics:

 The dependency relation that holds between two constituents is unique, i.e., no constituent relates to more than one constituent

 An element, except the headword, depends always on a constituent to its right

 No dependency relations cross each other This means the following dependency relation is not allowed:

Figure 3.2: Non-existing dependency relations within JCWs

Similarly, Uchiyama et al [23] identified semantic relations in Japanese compound nouns consisting of two constituents for patent document analysis The information such as grammatical or semantic features plays a key role for analyzing semantic relations in compound nouns Characteristically, the differences about grammatical features of nouns or nominal grammatical categories were used to analyze compound nouns Upon closer inspection of

Trang 34

words in sentences, commonly called nouns or nominal categories can be further classified as:

 Words can/cannot become a subject and a direct object in a sentence

 Words are regarded as a verb, when words occur with the verb suru ”do”

 Words can be used only in compound words or linked with adjective-forming suffix -teki, -na, and so on

 Words have the same form as nouns and adjectives (adjectival nouns)

Based on the observation that compound nouns can be expressed by paraphrasing with case particles, Uchiyama et al recognized a few sub-types of distributions with respect to grammatical links and/or case particles and these sub-types help to identify the semantic interpretation of compound nouns based on the previous study [24]:

 The nominative case -ga and/or accusative case -o normally indicates that the accompanying noun is the subject and/or the direct object of the sentence

 The verbal nouns with the verb -suru “do” is regarded as a verb Many verbal nouns come from Sino-Japanese (Chinese character) compounds These verbal nouns can be marked with case particles as nouns

 The suffix -teki can be compounded with a noun and an adjectival noun to modify a verb or to form a compound noun

 The adjective-forming suffix -na is added to an adjectival noun when it modifies the following noun

As our proposed method will analyze the grammatical features to identify the semantic relations between constituents, these grammatical feature analyses will

be discussed more concisely in Chapter 4

From the categories of grammatical features and links, semantic relations of constituents within JCNs can be defined and paraphrased in Concept Dictionary Language (CDL) Some examples of paraphrase by using case particles:

Trang 35

Table 3.1: Semantic relations and paraphrase

obj(affected thing) A o B suru, A-Acc B (Acc: accusative)

“morphological analysis”

pur(purpose or objective) A no-tame-no B, B for A

単語・辞書 tango-jisho “word dictionary”

mod(modification) A teki-na B, A adjective-forming suffix B

意味・解析 imi-kankei “semantic relation”

When studying translating compound verbs, Uchiyama et al [21] also discovered semantic factors of motion verbs A meaning of a motion verb can be expressed

by the conflation of motion with semantic factors A motion verb conflating of motion with semantic factor was regarded as a semantic factor verb, and there are four kinds of semantic factor verbs (see Figure 3.3: Semantic factors for motion verbs) For example, a motion verb having „Directional path‟ factor is defined as

„Directional path‟ verb The combination of these motion verbs creates the most popular JCV type: “verb-verb” type (mentioned in Section 2.2.2) such as

touri-nukeru “go past” in Ground Path and Ground Path, hairi-komu “get into”; uturi-yuku “change”, sugi-yuku “ go past” in Ground Path and Directional Path;

etc

Trang 36

Figure 3.3: Semantic factors for motion verbs

3.3 Selection of generated translation candidates

As mentioned in Section 3.1.1, the most likely translations according to term frequency in the target language, cross-lingual corpus, and web data are selected among the generated translation candidates in the selection phase Primarily, the most likely translations were easily selected in parallel corpus which is not widely available in the reality More realistically, the most correct translations were selected in comparable corpus with the help of contextual information [18] Co-occurrence information of compound words is obtained as context when target compound words appear from the corpora because it is observed that a compound word and its translation tend to appear in the same lexical contexts [11] The relationship between frequently co-occurring words and a compound word in different languages can be represented by vectors, and each vector element represents a word which occurs within the window of the compound word to be translated (for instance a seven-word window approximates syntactic dependencies) Translation is obtained by comparing the source context vector to each translation candidate vector after having translated each element of the

Trang 37

source vector with a general dictionary This method is appropriate for languages that comparable bilingual corpora are sufficient and morphological processing tools are good Comparable corpora improve the quality of context vectors and good morphological processing tools lead to good segmentation of words in a sentence serving for define context vectors

On the other hand, Baldwin and Tanaka [4] proposed the two selection methods, the first method based on monolingual corpus data and the second method combining monolingual corpus data and cross-lingual data derived from bilingual dictionaries Each method takes the list of generated translation candidates and scores each, returning the highest-scoring translation candidates as the final translation The first method used corpus-based translation quality which rates a given translation candidate according to corpus evidence (CTQ) for both the fully specified translation and its parts in the context of the translation template in question The CTQ for a noun-noun compound is calculated as:

where and are the word-level translations of the source language

and , respectively, and t is the translation templates Each probability is

calculated according to a maximum likelihood estimate based on relative corpus occurrence The formulation of CTQ is based on linear interpolation over and

, wherer 0 1 and 1 When being evaluated, this method showed its failure at the point that it treats all translations equally likely where in fact; there is considerable variability in their applicability One example of this is

the simplex 記事 kiji which is translated as either “article” or “item” (in the

sense of a newspaper), of which the former is clearly the more general translation Lacking knowledge of this conditional probability, the method considers the two translations to be equally probable, giving rise to the preferred translation of

“related item” for 関連・記事 kanren-kiji “related article” due to the markedly

greater corpus occurrence of related item over related article It is the aspect of

Trang 38

selection that Baldwin and Tanaka [4] proposed the second selection method mentioned above This method attempts to preserve the ability of CTQ to model target language expressional preferences, while incorporating more direct translation preferences at various levels of lexical specification For ease of feature expandability, and to avoid interpolation over excessively many terms, the backbone of the method is the TinySVM support vector machine (SVM) learner SVMs produce a binary classification, by returning a continuous value and determining whether it is closest to +1 (the positive class) or -1 (the negative class) This value is treated as a translation quality rating, and the translation candidates are ranked accordingly The best scoring exemplar is selected as the best translation candidate This selection method makes use of three basic feature types in generating a feature vector for each source language–translation candidate pair: corpus based features, bilingual dictionary-based features and template-based features All of these methods need proper monolingual and bilingual corpus morphologically processed for oriented languages like Japanese, Chinese, and Vietnamese The method using SVM above also particularly needs a great deal of data At the time that we were working with this study, we are short

of such kinds of corpus in Vietnamese and Japanese-Vietnamese For situations of being limit in proper corpus, the Web data are recently used instead The Web data, hence, can be treated as corpus which can be made use for prior mentioned selection methods using context information or CTQ and SVM On the other hand, Web statistics and utilizing search engines to verify terminologies are a trend of using the Web [14] Breen [26] used Google search engine to check the automatically created abbreviations of four-kanji compounds exist or not He gave several reasons for using Japanese pages in the WWW as a corpus including:

 The Web is a very large collection of text, with over 300 million pages indexed by common search engines (at the time 2004);

 It is freely available and is amenable to effective searches using search engines Comparable large corpora, such as newspaper archives, were not available without a prior commercial arrangement;

 Prior studies have indicated a high level of correlation between the WWW

Trang 39

and large corpora in such areas as word frequencies [14]

The abbreviation examination process started with searching a target abbreviation

by Google API (Application Program Interface) to get the frequency of occurrence

If the frequency of this target word was not zero, the text of each retrieved snippet will be scanned to reexamine whether the target word in it presents an adjacent pair of characters or not

Among the investigated selection methods, the method using data from the Web is suitable to our purposes of examining whether generated translation candidates really exist or not, and our limit situations of good morphological tools and corpora We would like to use search engines to check the availability of generated candidates from not only the above reasons that Breen has provided but also from the powerful search refinement capabilities of search engines Since Google brought a very good search engine for the world, other search engine systems must improve their searching qualities This fact leaded us to the decision

to use the best present search engines: Google, Yahoo! and Bing (formerly Live Search) to combine their search capabilities together which is believed better than using only one search engine

3.4 Summary

From the above analyses, the approach can be used for this research problem can

be ultimately determined It certainly consists of two phases: generation and selection We would like to experiment both the morphologically-based translation method and our proposed method utilizing the semantic relation of JCW constituents in the generation phase In the selection phase, we decided to exploit search engines to get the frequencies of generated translation candidates and compare them to select the most likely translation candidate

Trang 40

Chapter 4

Models and methods

Chapter 3 described related works and general methods that can be used in this study In this chapter, we present models and methods used in this research more specifically

4.1 Translation system analysis and design

4.1.1 General analysis and design

According to the prior analyses in Chapter 3, the solution for this study mainly consists of two parts: generation and selection In the generation part, JCWs needed to be translated are processed by the extended morphologically-based compositional method to generate translation candidates These candidates will be filtered with the help of search engines to select the most likely translation candidates in the selection part The general architecture of the translation system

is showed in the Figure 4.1

Ngày đăng: 22/06/2016, 15:49

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN