Báo cáo khoa học: "Active Sample Selection for Named Entity Transliteration" pptx

After the data is obtained it is analyzed to identify repeating patterns which can be used to focus the training process of the model.. The sample selection process is guided by the Suf-

Trang 1

Active Sample Selection for Named Entity Transliteration

Dan Goldwasser Dan Roth Department of Computer Science University of Illinois Urbana, IL 61801

{goldwas1,danr}@uiuc.edu

Abstract This paper introduces a new method for

identifying named-entity (NE) transliterations

within bilingual corpora Current

state-of-the-art approaches usually require annotated data

and relevant linguistic knowledge which may

not be available for all languages We show

how to effectively train an accurate

transliter-ation classifier using very little data, obtained

automatically To perform this task, we

intro-duce a new active sampling paradigm for

guid-ing and adaptguid-ing the sample selection process.

We also investigate how to improve the

clas-sifier by identifying repeated patterns in the

training data We evaluated our approach

us-ing English, Russian and Hebrew corpora.

1 Introduction

This paper presents a new approach for constructing

a discriminative transliteration model

Our approach is fully automated and requires little

knowledge of the source and target languages

Named entity (NE) transliteration is the process of

transcribing a NE from a source language to a target

language based on phonetic similarity between the

entities Figure 1 provides examples of NE

translit-erations in English Russian and Hebrew

Identifying transliteration pairs is an important

component in many linguistic applications such as

machine translation and information retrieval, which

require identifying out-of-vocabulary words

In our settings, we have access to source language

NE and the ability to label the data upon request

We introduce a new active sampling paradigm that

Figure 1: NE in English, Russian and Hebrew.

aims to guide the learner toward informative sam-ples, allowing learning from a small number of rep-resentative examples After the data is obtained it is analyzed to identify repeating patterns which can be used to focus the training process of the model Previous works usually take a generative approach, (Knight and Graehl, 1997) Other approaches ploit similarities in aligned bilingual corpora; for ex-ample, (Tao et al., 2006) combine two unsupervised methods (Klementiev and Roth, 2006) bootstrap with a classifier used interchangeably with an un-supervised temporal alignment method Although these approaches alleviate the problem of obtain-ing annotated data, other resources are still required, such as a large aligned bilingual corpus

The idea of selectively sampling training samples has been wildly discussed in machine learning the-ory (Seung et al., 1992) and has been applied suc-cessfully to several NLP applications (McCallum and Nigam, 1998) Unlike other approaches,our ap-proach is based on minimizing the distance between the feature distribution of a comprehensive reference set and the sampled set

2 Training a Transliteration Model

Our framework works in several stages, as summa-rized in Algorithm 1 First, a training set consisting 53

Trang 2

of NE transliteration pairs (w s , w t) is automatically

generated using an active sample selection scheme

The sample selection process is guided by the

Suf-ficient Spanning Features criterion (SSF) introduced

in section 2.2, to identify informative samples in the

source language.An oracle capable of pairing a NE

in the source language with its counterpart in the

tar-get language is then used Negative training samples

are generated by reshuffling the terms in these pairs

Once the training data has been collected, the data

is analyzed to identify repeating patterns in the data

which are used to focus the training process by

as-signing weights to features corresponding to the

ob-served patterns Finally, a linear model is trained

us-ing a variation of the averaged perceptron (Freund

and Schapire, 1998) algorithm The remainder of

this section provides details about these stages; the

basic formulation of the transliteration model and

the feature extraction scheme is described in section

2.1, in section 2.2 the selective sampling process is

described and finally section 2.3 explains how

learn-ing is focused by uslearn-ing feature weights

Input: Bilingual, comparable corpus (S, T ), set of

named entities N E S from S, Reference

Corpus R S, Transliteration Oracle O,

Training Corpora D=D S ,D T

Output: Transliteration model M

Guiding the Sampling Process

1

repeat

2

select a set C ⊆ N E Srandomly

3

w s = argmin w∈C distance(R, D S ∪ {w s })

4

D = D ∪ {W s , O(W s )}

5

until distance(R,D S ∪ {W s }) ≥ distance(R,D S) ;

6

Determining Features Activation Strength

7

Define W:f → < s.t foreach feature f ={f s , f t }

8

W (f) = ](f s ,f t)

](f s) × ](f s ,f t)

](f t)

9

Use D to train M;

10

Algorithm 1: Constructing a transliteration

model

2.1 Transliteration Model

Our transliteration model takes a discriminative

ap-proach; the classifier is presented with a word pair

(w s , w t ) , where w s is a named entity and it is

asked to determine whether w t is a transliteration

Figure 2: Features extraction process

of the NE in the target language We use a linear classifier trained with a regularized perceptron up-date rule (Grove and Roth, 2001) as implemented

in SNoW, (Roth, 1998) The classifier’s confi-dence score is used for ranking of positively tagged transliteration candidates Our initial feature extrac-tion scheme follows the one presented in (Klemen-tiev and Roth, 2006), in which the feature space con-sists of n-gram pairs from the two languages Given

a sample, each word is decomposed into a set of sub-strings of up to a given length (including the empty string) Features are generated by pairing substrings from the two sets whose relative positions in the original words differ by one or less places; first each word is decomposed into a set of substrings then substrings from the two sets are coupled to complete the pair representation Figure 2 depicts this process 2.2 Guiding the Sampling Process with SSF The initial step in our framework is to generate a training set of transliteration pairs; this is done by pairing highly informative source language candi-date NEs with target language counterparts We de-veloped a criterion for adding new samples, Suffi-ciently Spanning Features (SSF), which quantifies the sampled set ability to span the feature space This is done by evaluating the L-1 distance be-tween the frequency distributions of source language word fragments in the current sampled set and in

a comprehensive set of source language NEs, serv-ing as reference We argue that since the features used for learning are n-gram features, once these two distributions are close enough, our examples space provides a good and concise characterization

of all named entities we will ever need to con-sider A special care should be given to choos-ing an appropriate reference; as a general guide-line the reference set should be representative of the testing data We collected a set R, consisting

Trang 3

of 50,000 NE by crawling through Wikipedia’s

arti-cles and using an English NER system available at

- http://L2R.cs.uiuc.edu/ cogcomp The frequency

distribution was generated over all character level

bi-grams appearing in the text, as bi-grams best

cor-relate with the way features are extracted Given a

reference text R, the n-grams distribution of R can be

defined as follows -D R (ng i ) = P]ng i

j ]ng j ,where ng

is an n-gram in R Given a sample set S, we measure

the L1distance between the distributions:

distance (R,S) =Png∈R | D R (ng)−D S (ng) |

Sam-ples decreasing the distance between the

distribu-tions were added to the training data Given a set

C of candidates for annotation, a sample w s ∈ C

was added to the training set, if

-w s = argmin w∈C distance(R, D S ∪ {w s }).

A sample set is said to have SSF, if the distance

re-mains constant as more samples are added

2.2.1 Transliteration Oracle Implementation

The transliteration oracle is essentially a mapping

between the named entities, i.e given an NE in the

source language it provides the matching NE in the

target language An automatic oracle was

imple-mented by crawling through Wikipedia topic aligned

document pairs Given a pair of topic aligned

doc-uments in the two languages, the topic can be

iden-tified either by identifying the top ranking terms or

by simply identifying the title of the documents By

choosing documents in Wikipedia‘s biography

cate-gory we ensured that the topic of the documents is

person NE

2.3 Training the transliteration model

The feature extraction scheme we use generates

fea-tures by coupling substrings from the two terms

Ideally, given a positive sample, it is desirable that

paired substrings would encode phonetically

simi-lar or a distinctive context in which the two scripts

correlate Given enough positive samples, such

fea-tures will appear with distinctive frequency

Tak-ing this idea further, these features were recognized

by measuring the co-occurrence frequency of

sub-strings of up to two characters in both languages

Each feature f=(f s , f t ) composed of two substrings

taken from English and Hebrew words was

associ-ated with weight W (f) = ](f s ,f t)

](f s) × ](f s ,f t)

](f t) where

Data Set Method Rus Heb

Table 1: Results summary The numbers are the pro-portion of NE recognized in the target language Lines 1 and 2 compare the results of SSF directed approach with the baseline system on the first dataset Line 3 summa-rizes the results on the second dataset.

](f s , f t) is the number of occurrences of that feature

in the positive sample set, and ](f L) is the number of occurrences of an individual substring, in any of the features extracted from positive samples in the train-ing set The result of this process is a weight table,

in which, as we empirically tested, the highest rank-ing weights were assigned to features that preserve the phonetic correlation between the two languages

To improve the classifier’s learning rate, the learn-ing process is focused around these features Given

a sample, the learner is presented with a real-valued feature vector instead of a binary vector, in which each value indicates both that the feature is active and its activation strength - i.e the weight assigned

to it

3 Evaluation

We evaluated our approach in two settings; first, we compared our system to a baseline system described

in (Klementiev and Roth, 2006) Given a bilingual corpus with the English NE annotated, the system had to discover the NE in target language text We used the English-Russian news corpus used in the baseline system NEs were grouped into equiva-lence classes, each containing different variations of the same NE We randomly sampled 500 documents from the corpus Transliteration pairs were mapped into 97 equivalence classes, identified by an expert

In a second experiment, different learning parame-ters such as selective sampling efficiency and feature weights were checked 300 English-Russian and English-Hebrew NE pairs were used; negative sam-ples were generated by coupling every English NE with all other target language NEs Table 1 presents the key results of these experiments and compared with the baseline system

Trang 4

Extraction Number Recall Recall

samples

Table 2: Comparison of correctly identified

English-Russian transliteration pairs in news corpus The model

trained using selective sampling outperforms models

trained using random sampling, even when trained with

twice the data The top one and top two results

columns describe the proportion of correctly identified

pairs ranked in the first and top two places, respectively.

3.1 Using SSF directed sampling

Table 2 describes the effect of directed sampling

in the English-Russian news corpora NE discovery

task Results show that models trained using

selec-tive sampling can outperform models trained with

more than twice the amount of data

3.2 Training using feature weights

Table 3 describes the effect training the model with

weights.The training set consisted of 150 samples

extracted using SSF directed sampling Three

varia-tions were tested - training without feature weights,

using the feature weights as the initial network

weights without training and training with weights

The results clearly show that using weights for

train-ing improve the classifier’s performance for both

Russian and Hebrew It can also be observed that

in many cases the correct pair was ranked in any of

the top five places

4 Conclusions and future work

In this paper we presented a new approach for

con-structing a transliteration model automatically and

efficiently by selectively extracting transliteration

samples covering relevant parts of the feature space

and focusing the learning process on these features

We show that our approach can outperform

sys-tems requiring supervision, manual intervention and

a considerable amount of data We propose a new

measure for selective sample selection which can be

used independently We currently investigate

apply-ing it in other domains with potentially larger feature

Learning Russian Hebrew Train- Feature Top Top Top Top ing weights one five one five

Table 3: The proportion of correctly identified transliter-ation pairs with/out using weights and training The top one and top five results columns describe the proportion

of correctly identified pairs ranked in the first place and

in any of the top five places, respectively The results demonstrate that using feature weights improves perfor-mance for both target languages.

space than used in this work Another aspect inves-tigated is using our selective sampling for adapting the learning process for data originating from dif-ferent sources; using the a reference set representa-tive of the testing data, training samples, originating from a different source , can be biased towards the testing data

5 Acknowledgments

Partly supported by NSF grant ITR IIS-0428472 and DARPA funding under the Bootstrap Learning Pro-gram

References

Y Freund and R E Schapire 1998 Large margin

clas-sification using the perceptron algorithm In COLT.

A Grove and D Roth 2001 Linear concepts and hidden

variables ML, 42.

A Klementiev and D Roth 2006 Weakly supervised named entity transliteration and discovery from

multi-lingual comparable corpora In ACL.

K Knight and J Graehl 1997 Machine transliteration.

In EACL.

D K McCallum and K Nigam 1998 Employing EM

in pool-based active learning for text classification In

ICML.

D Roth 1998 Learning to resolve natural language

am-biguities: A unified approach In AAAI.

H S Seung, M Opper, and H Sompolinsky 1992.

Query by committee In COLT.

T Tao, S Yoon, A Fister, R Sproat, and C Zhai 2006 Unsupervised named entity transliteration using

tem-poral and phonetic correlation In EMNLP.

Định dạng
Số trang	4
Dung lượng	560,54 KB