Enhancing the quality of MachineTranslation System Using Cross-LingualWord Embedding Models

13 3 Using Cross-Lingual Word Embedding Models for Machine Trans-lation Systems 17 3.1 Enhancing the quality of Phrase-table in SMT Using Cross-Lingual Word Embedding.. 19 3.2 Addressing

Trang 1

Enhancing the quality of Machine Translation System Using Cross-Lingual

Word Embedding Models

Nguyen Minh Thuan

Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi

Supervised by Associate Professor Nguyen Phuong Thai

A thesis submitted in fulfillment of the requirements

for the degree ofMaster of Science in Computer Science

November 2018

Trang 3

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge

it contains no materials previously published or written by another person, or tial proportions of material which have been accepted for the award of any other degree

substan-or diploma at University of Engineering and Technology (UET/Coltech) substan-or any othereducational institution, except where due acknowledgement is made in the thesis Anycontribution made to the research by others, with whom I have worked at UET/Coltech

or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectualcontent of this thesis is the product of my own work, except to the extent that assistancefrom others in the project’s design and conception or in style, presentation and linguisticexpression is acknowledged.’

Hanoi, November 15th, 2018

Signed

i

Trang 4

In recent years, Machine Translation has shown promising results and received muchinterest of researchers Two approaches that have been widely used for machine trans-lation are Phrase-based Statistical Machine Translation (PBSMT) and Neural Ma-chine Translation (NMT) During translation, both approaches rely heavily on largeamounts of bilingual corpora which require much effort and financial support Thelack of bilingual data leads to a poor phrase-table, which is one of the main compo-nents of PBSMT, and the unknown word problem in NMT In contrast, monolingualdata are available for most of the languages Thanks to the advantage, many models

of word embedding and cross-lingual word embedding have been appeared to improvethe quality of various tasks in natural language processing The purpose of this thesis

is to propose two models for using cross-lingual word embedding models to addressthe above impediment The first model enhances the quality of the phrase-table inSMT, and the remaining model tackles the unknown word problem in NMT

Publications:

? Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai Nguyen and Chi-Mai Luong Enhancing the quality of Phrase-table in Statistical Machine Translation for Less-Common and Low-Resource Languages In the 2018 International Conference on Asian Language Processing (IALP 2018).

Trang 5

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to my lecturers in university, andespecially to my supervisors - Assoc.Prof Nguyen Phuong Thai, Dr Nguyen VanVinh and MSc Vu Huy Hien They are my inspiration, guiding me to get the better

of many obstacles in the completion this thesis

I am grateful to my family They usually encourage, motivate and create thebest conditions for me to accomplish this thesis

I would like to also thank my brother, Nguyen Minh Thong, my friends,Tran Minh Luyen, Hoang Cong Tuan Anh, for giving me many useful advices andsupporting my thesis, my studying and my living

Finally, I sincerely acknowledge the Vietnam National University, Hanoi andespecially, TC.02-2016-03 project named “Building a machine translation system

to support translation of documents between Vietnamese and Japanese to helpmanagers and businesses in Hanoi approach Japanese market” for supportingfinance to my master study

Trang 6

iv

Trang 7

Table of Contents

2 Literature review 4

2.1 Machine Translation 4

2.1.1 History 4

2.1.2 Approaches 5

2.1.3 Evaluation 7

2.1.4 Open-Source Machine Translation 8

2.1.4.1 Moses - an Open Statistical Machine Translation System 9

2.1.4.2 OpenNMT - an Open Neural Machine Translation System 10

2.2 Word Embedding 11

2.2.1 Monolingual Word Embedding Models 12

2.2.2 Cross-Lingual Word Embedding Models 13

3 Using Cross-Lingual Word Embedding Models for Machine Trans-lation Systems 17 3.1 Enhancing the quality of Phrase-table in SMT Using Cross-Lingual Word Embedding 17

3.1.1 Recomputing Phrase-table weights 18

3.1.2 Generating new phrase pairs 19

3.2 Addressing the Unknown Word Problem in NMT Using Cross-Lingual Word Embedding Models 21

4 Experiments and Results 27 4.1 Settings 27

4.2 Results 31

v

Trang 8

4.2.1 Word Translation Task 31

4.2.2 Impact of Enriching the Phrase-table on SMT system 32

4.2.3 Impact of Removing the Unknown Words on NMT system 35

Trang 9

List of Figures

2.1 The CBOW model predicts the current word based on the context, and the Skip-gram predicts surrounding words based on the current

word 13

2.2 Toy illustration of the cross-lingual embedding model 14

3.1 Flow of training phrase 22

3.2 Flow of testing phrase 23

3.3 Example in testing phrase 25

vii

Trang 10

3.1 The sample of new phrase pairs generated by using projections of

word vector representations 21

4.1 Monolingual corpora 28

4.2 Bilingual corpora 28

4.3 Bilingual dictionaries 29

4.4 The precision of word translation retrieval top-k nearest neighbors in Vietnamese-English and Japanese-Vietnamese language pairs 32

4.5 Results on UET and TED dataset in the PBSMT system for Vietnamese-English and Japanese-Vietnamese respectively 33

4.6 Translation examples of the PBSMT in Vietnamese-English 34

4.7 Results of removing unknown words on UET and TED dataset in the NMT system for Vietnamese-English and Japanese-Vietnamese respectively 35

4.8 Translation examples of the NMT system in Vietnamese-English 37

viii

Trang 11

List of Abbreviations

MT Machine Translation

SMT Statistical Machine Translation

PBSMT Phrase-based Statistical Machine TranslationNMT Neural Machine Translation

NLP Natural Language Processing

RNN Recurrent Neural Network

CNN Convolutional Neural Network

UNMT Unsupervised Neural Machine Translation

ix

Trang 12

Machine Translation (MT) is a sub-field of computational linguistics It is mated translation, which translates text or speech from one natural language toanother by using computer software Nowadays, machine translation systems attainmuch success in practice, and two approaches that have been widely used for MT arePhrase-based statistical machine translation (PBSMT) and Neural Machine Trans-lation (NMT) In the PBSMT system, the core of this system is the phrase-table,which contains words and phrases for SMT system to translate In the translationprocess, sentences are split into distinguished parts as shown in (Koehn et al.,2007)(Koehn, 2010) At each step, for a given source phrase, the system will try to findthe best candidate amongst many target phrases as its translation based mainly onphrase-table Hence, having a good phrase-table possibly makes translation systemsimprove the quality of translation However, attaining a rich phrase-table is a chal-lenge since the phrase-table is extracted and trained from large amounts of bilingualcorpora which require much effort and financial support, especially for less-commonlanguages such as Vietnamese, Laos, etc In the NMT system, two main componentsare encoder and decoder the encoder component uses a neural network, such as therecurrent neural network (RNN), to encode the source sentence, and the decodercomponent also uses a neural network to predict words in the target language SomeNMT models incorporate attention mechanisms to improve the translation quality

auto-To reduce the computational complexity, conventional NMT systems often limittheir vocabularies to be the top 30K-80K most frequent words in the source andtarget language, and all words outside the vocabulary, called unknown words, arereplaced into a single unk symbol This approach leads to the inability to generate

1

Trang 13

of all translation systems.

(Cui et al.,2013) utilized techniques of pivot languages to enrich their phrase-table.Their phrase-table is made of source-pivot and pivot-target phrase-tables As aresult of this combination, they attained a significant improvement of translation.Similarly, (Zhu et al.,2014) used a method based on pivot languages to calculate thetranslation probabilities of source-target phrase pairs and achieved a slight enhance-ment Unfortunately, the methods based on pivot languages are not able to applyfor the Vietnamese language since the the less-common nature of this language.(Vogel and Monson, 2004) improved the translation quality by using phrase pairsfrom an augmented dictionary They first augmented the dictionary using simplemorphological variations and then assigned probabilities to entries of this dictionary

by using co-occurrence frequencies collected from bilingual data However, theirmethod needs a lot of bilingual corpora to estimate accurately the probabilities fordictionary entries, which are not available for low-resource languages

In order to address the unknown word problem in NMT system (Luong et al.,

2015b) annotated the training bilingual corpus with explicit alignment informationthat allows the NMT system to emit, for each unknown word in the target sentence,the position of its corresponding word in the source sentence This information isthen used in a post-processing step to translate every unknown word by using abilingual dictionary The method showed a substantial improvement of up to 2.8BLEU points over various NMT systems on WMT’14 English-French translationtask However, having the good dictionary, which is utilized in the post-processingstep, is also costly and time-consuming

(Sennrich et al., 2016) introduced a simple approach to handle the translation ofunknown words in NMT by encoding unknown words as a sequence of subword units.This method based on the intuition that a variety of word classes are translated viasmaller units than words For example, names are translated by character copying or

Trang 14

transliteration, compounds are translated via compositional translation, etc Theapproach indicated an improvement up to 1.3 BLEU over a back-off dictionarybaseline model on WMT 15 English-Russian translation task.

(Li et al., 2016) proposed a novel substitution-translation-restoration method totackle the problem of the NMT unknown word In this method, the substitutionstep replaces the unknown words in a testing sentence with similar in-vocabularywords based on a similarity model learned from monolingual data The translationstep then translates the testing sentence with a model trained on bilingual data withunknown words replaced Finally, the restoration step substitutes the translations ofthe replaced words by that of original ones This method demonstrated a significantimprovement up to 4 BLEU points over the attention-based NMT on Chinese-to-English translation

Recently, techniques using word embedding receive much interest from naturallanguage processing communities Word embedding is a vector representation ofwords which conserves semantic information and their contexts words Additionally,

we can exploit the advantage of embedding to represent words in diverse distinctionspaces as shown in (Mikolov et al., 2013b) Besides, cross-lingual word embeddingmodels are also receiving a lot of interest, which learn cross-lingual representations

of words in a joint embedding space to represent meaning and transfer knowledge incross-lingual scenarios Inspired by the advantages of the cross-lingual embeddingmodels, the work of (Mikolov et al., 2013b) and (Li et al., 2016), we propose amodel to enhance the quality of a phrase-table by recomputing the phrase weightsand generating new phrase pairs for the phrase-table, and a model to address theunknown word problem in the NMT system by replacing the unknown words withthe most appropriate in-vocabulary words

The rest of this thesis is organized as follows: Chapter 2 gives an overview ofrelated backgrounds In Chapter 3, we describe our two proposed models A modelenhances the quality of phrase-table in SMT, and the remaining model tackles theunknown word problem in NMT Settings and results of our experiments are shown

in Chapter 4 We indicate our conclusion and future works in Chapter 5

Trang 15

Chapter 2

Literature review

In this chapter, we indicate an overview of Machine Translation (MT) research andWord Embedding models in section 2.1 and 2.2 respectively Section 2.1 shows thehistory, approaches, evaluation and open-source in MT In section 2.2, we introduce

an overview of Word Embedding including Monolingual and Cross-Lingual WordEmbedding models

ap-In the mid-1930s, Georges Artsrouni attempted to build “translation machines” byusing paper tape to create an automatic dictionary After that, Peter Troyanskiiproposed a model including a bilingual dictionary and a method for handling gram-matical issues between languages based on the Esperanto’s grammatical system

On January 7th, 1954, at the head office of IBM in New York, the first machinetranslation system was published by Georgetown-IBM experiment It automaticallytranslated 60 sentences from Russian to English for the first time and opened a racefor machine translation in many countries, such as Canada, Germany, and Japan.However, in 1966, the Automatic Language Processing Advisory Committee (AL-

4

Trang 16

PAC) reported that the ten-year-long research failed to fulfill expectations in (Vogel

et al., 1996) During the 1980s, a lot of activities in MT were executed, especially

in Japan At this time, research in MT typically depended on translation through avariety of intermediary linguistic representation including syntactic, morphological,and semantic analysis At the end of the 1980s, since computational power increasedand became less expensive, more research was attempted in the statistical approachfor MT

During the 2000s, research in MT has seen major changes A lot of research hasfocused on example-based machine translation and statistical machine translation(SMT) Besides, researchers also gave more interests in hybridization by combiningmorphological and syntactic knowledge into statistical systems, as well as combiningstatistics with existing rule-based systems Recently, the hot trend of MT is using alarge artificial neural network into MT, called Neural Machine Translation (NMT)

In 2014, (Cho et al.,2014) published the first paper on using neural networks in MT,followed by a lot of research in the following few years Apart from the research

on bilingual machine translation systems, in 2018, researchers paid much attention

to unsupervised neural machine translation (UNMT) which only used monolingualdata to train the MT system

Trang 17

con-2.1 Machine Translation 6

Statistical

Statistical Machine Translation (STM) system uses statistical models to generatetranslations based on the bilingual and monolingual corpus The basic idea of SMTcomes from information theory A sentence f in the source language is translated tothe sentence e in the target language based on the probability distribution p(e|f )

A simple way to modeling the probability distribution p(e|f ) is to apply BayesTheorem, which is:

p(e|f ) ∝ p(f |e)p(e)where p(e|f ) is the translation model, which estimates the probability of sourcesentence f given the target sentence e, and p(e) is the language model, which is theprobability of seeing sentence e in the target language Therefore, finding the besttranslation ˆe is executed by maximizing the product p(e|f )p(e):

Example-based

In an Example-based machine translation (EBMT) system, a sentence is translated

by using the idea of analogy In this approach, the corpus that is used is large ofexisting translation pairs of source and target sentences Given a new source sentencethat is to be translated, the corpus is retrieved to select the sentences that containsimilar sub-sentential parts Then, the similar sentences are used to translate thesub-sentential parts of the original source sentence into the target language, andthese parts are put together to generate a complete translation

Neural Machine Translation

Neural Machine Translation (NMT) is the newest approach to MT and based on themodel of machine learning This approach uses a large artificial neural network topredict the likelihood of a sequence of words, typically encoding whole sentences in

a single integrated model The structure of the NMT models is simpler than that

Trang 18

of SMT models that uses vector representations (“embedding”, “continuous spacerepresentations”) for words and internal states The NMT contains a single sequencemodel to predict one word at a time There is no separate translation model,language model, reordering model The first NMT models are using a recurrentneural network (RNN), which uses a bidirectional RNN, known as an encoder, toencode the source sentence and a second RNN, known as a decoder, to predict words

in the target language NMT systems can continuously learn and be adjusted togenerate the best output and require a lot of computing power This is why thesemodels have only been developed strongly in recent years

Machine Translation evaluation is essential to examine the quality of a MT system

or compare different MT systems The simplest method to evaluate MT output isusing human judges However, human evaluation is costly and time-consuming andthus unsuitable for frequently developing and researching an MT system Therefore,various automatic methods have been studied to evaluate the quality of translationsuch as Word Error Rate (WER), Position independent word Error Rate (PER),the NIST score (Doddington, 2002), the BLEU score (Papineni et al.,2002), etc Inour work, we use BLEU for automatic evaluating our MT system configurations.BLEU is a popular method for automatic evaluating MT output that is quick,inexpensive, and language-independent as shown in (Papineni et al., 2002) Thebasic idea of this method is to compare n-grams of the MT output with n-grams ofthe standard translation and count the number of matches The more the matches,the better the MT output is A BLEU formula is shown as follows:

The BLEU n-gram precision pn are computed by summing the n-gram matches forall the candidate sentences in the test corpus C:

pn =

PC∈{Candidates}

Pngram∈CCountmatched(ngram)P

Trang 19

where n is the orders of n-gram considered for pn and wn is the weights signed for the n-gram precisions In the baseline, N = 4 and weights are uniformlydistributed

In order to stimulate the development of the MT research community, a variety offree and complete toolkits for MT are provided With the statistical (or data-driven)approach to MT, we can consider some systems as follows:

Moses1: a complete SMT system

UCAM-SMT2: the Cambridge SMT system

Phrasal3: a toolkit for phrase-based SMT

Joshua4: a decoder for syntax-based SMT

Pharaoh5: a decoder for IBM Model 4

Besides, because of the superiority of NMT over SMT, NMT has received muchattention from researchers and companies The following start-of-the-art NMT sys-tems are totally free and easy to setup:

OpenNMT6: a sytem is designed to be simple to use and easy to extend veloped by Harvard university and SYSTRAN

de- Google-GNMT7: a competitive sequence-to-sequence model developed by Google

Trang 20

Facebook-fairseq8: a system is implemented with Convolutional Neural work (CNN), which can achieve a similar performance as the RNN-based NMTwhile running nine times faster developed by Facebook AI Research.

Net- Amazon-Sockeye9: a sequence-to-sequence framework based on Apache MXNetare developed by Amazon

In this part, we introduce two MT systems, which are used in our work The firstsystem is Moses - an open system for SMT and the remaining system is OpenNMT

- an open system for NMT

2.1.4.1 Moses - an Open Statistical Machine Translation System

Moses, which was introduced by (Koehn et al., 2007), is a complete open sourcetoolkit for statistical machine translation It can automatically train translationmodels for any language pair from a collection of translated sentences (parallel data).Due to the trained model, an efficient search algorithm is used to quickly find thehighest probability translation among an exponential numbers of candidates.There are two main components in Moses: the training pipeline and the de-coder The training pipeline contains a variety of tools which take the paralleldata and train it into a translation model Firstly, the data needs to be cleaned byinserting spaces words and punctuation (tokenisation), removing long and emptysentences, etc Secondly, some external tools are then used for word alignment such

as GIZA++ in (Och and Ney, 2003), MGIZA++ These word alignments are thenused to extract phrase translation pairs or hierarchical rules These phrase pairs orrules are then scored by using corpus-wide statistics Finally, weights of differentstatistical models are tuned to generate the best possible translations MERT in(Och, 2003) is used to tune weights in Moses In the decoder process, Moses usesthe trained translation model to translate the source sentence into the target sen-tence To overcome the huge search problem in decoding, Moses implements severaldifferent algorithms for this search such as stack-based, cube-pruning, chart pars-ing etc Besides, an important part of the decoder is the language model, which istrained from the monolingual data in the target language to ensure the fluency ofthe output Moses supports many kinds of language model tools such as KENLM in(Heafield, 2011), SRILM in (Stolcke, 2002), IRSTLM in (Federico et al.,2008), etc

8 https://github.com/facebookresearch/fairseq

9 https://github.com/awslabs/sockeye

Trang 21

OpenNMT-lua: the original project, which developed with LuaTorch, readyfor quick experiments and production.

OpenNMT-py: this implementation is a clone of OpenNMT-lua, which usethe more modern Pytorch, easy to extend and especially suited for research

OpenNMT-tf: This implementation is a general purpose sequence modelingtool in TensorFlow focusing on large-scale experiments and high-performancemodels

The structure of the Neural Machine Translation system in OpenNMT is cally implemented as an encoder-decoder architecture (Bahdanau et al., 2014) Theencoder is a recurrent neural network (RNN) or a bidirectional recurrent neuralnetwork that encodes a source sentence x = {x1, , xTc} into a sequence of hiddenstates h = {h1, , hTc}:

typi-ht= fenc(e(xt), ht−1) (2.4)

where ht is the hidden state at time step t, e(xt) is the embedding of xt, Tc isthe number of symbols in the source sentence, and the function fenc is the recurrentunit such as the gated recurrent unit (GRU) or the long short-term memory (LSTM)unit The decoder is also a recurrent neural network which is trained to predict theconditional probability of each symbol yt given its preceding symbols y<t and thecontext vector ct:

P (yt|y<t) = g(e(yt−1), rt−1, ct) (2.5)

Trang 22

rt= fdec(e(yt), rt−1, ct) (2.6)

where rtis the hidden state of the decoder at time step t and updated by fdec, e(yt)

is the embedding of target symbols yt, and g is a nonlinear function that computesthe probability of yt In each decoding step, the context vector ct is computed bysumming the weight of source hidden states:

nat-of words which conserves semantic information and their contexts words in (Huang

et al., 2012) (Mikolov et al., 2013a) (Mikolov et al., 2013b) Additionally, we canexploit the advantage of embedding to represent words in diverse distinction spaces

as shown in (Mikolov et al.,2013b) Besides, applying word embedding to gual applications is also receiving a lot of interest Therefore, learning cross-lingualembedding models, which learn cross-lingual representations of words in a jointembedding space, to represent meaning and transfer knowledge in cross-lingual sce-narios is necessary In this section, we introduce models about monolingual andcross-lingual word embedding

Trang 23

multilin-2.2 Word Embedding 12

During the 1990s, vector space models have been applied for distributional tics A variety of models are then developed for estimating continuous representa-tions of words such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis(LSA), etc The term word embeddings was first used by (Bengio et al., 2003),who learned word representation by using a feed-forward neural network Recently,(Mikolov et al.,2013a) proposed new models for learning effectively distributed rep-resentation of words by using a feed-forward neural network, known as word2vec.They provided two neural networks for learning word vectors: Continuous Skip-gramand Continuous Bag-of-Words (CBOW) In CBOW, a feed-forward neural networkwith an input layer, a projection layer, and an output layer is used to predict thecurrent word based context words as shown in Figure2.1 In this architecture, theprojection layer is common among all words, the input is a window of n future wordsand n history words of the current word All the input words are projected to acommon space, and the current word is then predicted by averaging these input vec-tors In contrast to CBOW, Skip-gram model uses the current word to predict thesurrounding words as shown in Figure2.1 The input of this model is a center word,which is fed into the projection layer and the output is 2 * n vectors for n historyand n future words In practice, in case of limited monolingual data, Skip-gramindicates a better word representation than CBOW However, CBOW is faster and

seman-is suggested for larger datasets

A year later, (Pennington et al.,2014) introduced Global vectors (GloVe), a itive set of pre-trained embeddings Glove learns representations of words throughmatrix factorization Glove proposes a weighted least squares objective LGloV e,which minimizes the difference between the dot product of the embedding of a word

compet-wi and its context word wj and the logarithm of their number of co-occurrences:

LGloV e=

|V |X

Trang 24

Figure 2.1: The CBOW model predicts the current word based on the context, andthe Skip-gram predicts surrounding words based on the current word.

Cross-lingual word embeddings models learn the cross-lingual representation of words

in a joint embedding space to represent meaning and transfer knowledge in lingual applications Recently, many models for learning cross-lingual embeddingshave been proposed as shown in (Ruder et al.,2017) - a survey of cross-lingual wordembedding models In this section, we introduce three models in (Mikolov et al.,

cross-2013b), (Xing et al.,2015) and (Conneau et al.,2017), which are used in our iments to enhance the quality of MT system In the models, they always assumethat they have two sets of embeddings trained independently on monolingual data.Their work focuses on learning a mapping between two sets such that translationsare close in the shared space

exper-Cross-lingual embedding model in (Mikolov et al., 2013b)

(Mikolov et al., 2013b) show that they can exploit the similarities of monolingualembedding space by learning a linear projection between vector spaces representingeach language They first build vector representation models of languages usinglarge amounts of monolingual data Next, they use a small bilingual dictionary tolearn a linear projection between the languages For this purpose, they use a dictio-nary of n = 5000 word-pairs {xi, zi}i∈{1,n} to find a transformation matrix W suchthat W xi approximates zi In practice, learning the transformation matrix W can

Trang 25

2.2 Word Embedding 14

be considered as an optimization problem and it can be solved by minimizing thefollowing error function using a gradient descent method:

minW

nX

i=1

At the test time, given new word and its continuous vector representation x, theycan map it to the other language space by computing z = Wx the word whoserepresentation is closest to z in the target language space is then retrieved by usingcosine similarity as the distance metric

Cross-lingual embedding model in (Xing et al., 2015)

Inspired by the work of (Mikolov et al.,2013b), (Xing et al.,2015) pointed that theEuler distance in the objective function as shown in Equation2.11 is fundamentallydifferent from the cosine distance, which is used to measure the ‘closeness’ of words

in the projection space and hence causes inconsistence They solved this problem

by enforcing an orthogonality constraint on W And then, the Equation 2.11 isconsidered as the Procrustes problem in (Sch¨onemann, 1966), which provided asolution obtained from the singular value decomposition (SVD) of ZXT, where Xand Z are two matrices of size d × n containing the embeddings of the words in thebilingual dictionary The formula is shown as follows:

W∗ = argmin

W

||W X − Z||2 = U VT, with U ΣVT = SVD(ZXT) (2.12)

Cross-lingual embedding model in (Conneau et al., 2017)

The two above models reported good performance on the word translation task byusing a small bilingual dictionary to learn the linear mapping In this model, theauthors show how to learn the mapping W without using any bilingual data, theirmodel even outperforms existing supervised methods on cross-lingual tasks fromsome pairs of languages An illustration of this model is shown in Figure 2.2 In

Figure 2.2: Toy illustration of the cross-lingual embedding model

Trang 26

the illustration, (A) shows two set of pre-trained word embeddings, English wordsdenoted by X and Italian words denoted by Y Each dot indicates a word in thatspace The size of the dot is proportional to the frequency of the words in the train-ing corpus of that language (B) introduces a method to learn an initial proxy of W

by using an adversarial criterion The stars are randomly selected words that are fed

to the discriminator (C) presents using the best-matched words as anchor points

to refine the mapping W via Procrustes (D) changes the metric of the space toimprove performance over less frequent words The detail of this model is described

as follows

For learning W without using bilingual data, authors use domain-adversarial proach Let X = {x1, , xn} and Y = {y1, , ym} be two sets of n and m wordembeddings of a source and target language respectively A model called discrimi-nator is trained to discriminate between elements randomly sampled from WX and

ap-Y W is trained to prevent the discriminator from accurately predicting the criminator loss is shown as below:

Dis-LD(θD|W ) = −1

n

nX

i=1

logPθD(source = 1|W xi) − 1

m

mX

i=1logPθD(source = 0|yi)

(2.13)where θD is the discriminator parameters The probability PθD(source = 1|z) rep-resents that a vector z is the mapping of a source embedding according to thediscriminator

the mapping W is trained as the following function:

LW(W |θD) = −1

n

nX

i=1

logPθD(source = 0|W xi) − 1

m

mX

i=1logPθD(source = 1|yi)

a more accurate dictionary

To increase the similarity associated with isolated word vectors and decrease the ones

of vectors being in dense areas, a similarity measure named Cross-domain similaritylocal scaling (CSLS) is proposed This measure computes the similarity between

Trang 27

rT(W xs) = 1

KX

y t ∈N T (W x s )

cos(W xs, yt) (2.16)

where NT(W xs) is the set of K nearest neighbors in the target space of the sourceword

Định dạng
Số trang	54
Dung lượng	753,02 KB