After the data is obtained it is analyzed to identify repeating patterns which can be used to focus the training process of the model.. The sample selection process is guided by the Suf-
Trang 1Active Sample Selection for Named Entity Transliteration
Dan Goldwasser Dan Roth Department of Computer Science University of Illinois Urbana, IL 61801
{goldwas1,danr}@uiuc.edu
Abstract This paper introduces a new method for
identifying named-entity (NE) transliterations
within bilingual corpora Current
state-of-the-art approaches usually require annotated data
and relevant linguistic knowledge which may
not be available for all languages We show
how to effectively train an accurate
transliter-ation classifier using very little data, obtained
automatically To perform this task, we
intro-duce a new active sampling paradigm for
guid-ing and adaptguid-ing the sample selection process.
We also investigate how to improve the
clas-sifier by identifying repeated patterns in the
training data We evaluated our approach
us-ing English, Russian and Hebrew corpora.
1 Introduction
This paper presents a new approach for constructing
a discriminative transliteration model
Our approach is fully automated and requires little
knowledge of the source and target languages
Named entity (NE) transliteration is the process of
transcribing a NE from a source language to a target
language based on phonetic similarity between the
entities Figure 1 provides examples of NE
translit-erations in English Russian and Hebrew
Identifying transliteration pairs is an important
component in many linguistic applications such as
machine translation and information retrieval, which
require identifying out-of-vocabulary words
In our settings, we have access to source language
NE and the ability to label the data upon request
We introduce a new active sampling paradigm that
Figure 1: NE in English, Russian and Hebrew.
aims to guide the learner toward informative sam-ples, allowing learning from a small number of rep-resentative examples After the data is obtained it is analyzed to identify repeating patterns which can be used to focus the training process of the model Previous works usually take a generative approach, (Knight and Graehl, 1997) Other approaches ploit similarities in aligned bilingual corpora; for ex-ample, (Tao et al., 2006) combine two unsupervised methods (Klementiev and Roth, 2006) bootstrap with a classifier used interchangeably with an un-supervised temporal alignment method Although these approaches alleviate the problem of obtain-ing annotated data, other resources are still required, such as a large aligned bilingual corpus
The idea of selectively sampling training samples has been wildly discussed in machine learning the-ory (Seung et al., 1992) and has been applied suc-cessfully to several NLP applications (McCallum and Nigam, 1998) Unlike other approaches,our ap-proach is based on minimizing the distance between the feature distribution of a comprehensive reference set and the sampled set
2 Training a Transliteration Model
Our framework works in several stages, as summa-rized in Algorithm 1 First, a training set consisting 53
Trang 2of NE transliteration pairs (w s , w t) is automatically
generated using an active sample selection scheme
The sample selection process is guided by the
Suf-ficient Spanning Features criterion (SSF) introduced
in section 2.2, to identify informative samples in the
source language.An oracle capable of pairing a NE
in the source language with its counterpart in the
tar-get language is then used Negative training samples
are generated by reshuffling the terms in these pairs
Once the training data has been collected, the data
is analyzed to identify repeating patterns in the data
which are used to focus the training process by
as-signing weights to features corresponding to the
ob-served patterns Finally, a linear model is trained
us-ing a variation of the averaged perceptron (Freund
and Schapire, 1998) algorithm The remainder of
this section provides details about these stages; the
basic formulation of the transliteration model and
the feature extraction scheme is described in section
2.1, in section 2.2 the selective sampling process is
described and finally section 2.3 explains how
learn-ing is focused by uslearn-ing feature weights
Input: Bilingual, comparable corpus (S, T ), set of
named entities N E S from S, Reference
Corpus R S, Transliteration Oracle O,
Training Corpora D=D S ,D T
Output: Transliteration model M
Guiding the Sampling Process
1
repeat
2
select a set C ⊆ N E Srandomly
3
w s = argmin w∈C distance(R, D S ∪ {w s })
4
D = D ∪ {W s , O(W s )}
5
until distance(R,D S ∪ {W s }) ≥ distance(R,D S) ;
6
Determining Features Activation Strength
7
Define W:f → < s.t foreach feature f ={f s , f t }
8
W (f) = ](f s ,f t)
](f s) × ](f s ,f t)
](f t)
9
Use D to train M;
10
Algorithm 1: Constructing a transliteration
model
2.1 Transliteration Model
Our transliteration model takes a discriminative
ap-proach; the classifier is presented with a word pair
(w s , w t ) , where w s is a named entity and it is
asked to determine whether w t is a transliteration
Figure 2: Features extraction process
of the NE in the target language We use a linear classifier trained with a regularized perceptron up-date rule (Grove and Roth, 2001) as implemented
in SNoW, (Roth, 1998) The classifier’s confi-dence score is used for ranking of positively tagged transliteration candidates Our initial feature extrac-tion scheme follows the one presented in (Klemen-tiev and Roth, 2006), in which the feature space con-sists of n-gram pairs from the two languages Given
a sample, each word is decomposed into a set of sub-strings of up to a given length (including the empty string) Features are generated by pairing substrings from the two sets whose relative positions in the original words differ by one or less places; first each word is decomposed into a set of substrings then substrings from the two sets are coupled to complete the pair representation Figure 2 depicts this process 2.2 Guiding the Sampling Process with SSF The initial step in our framework is to generate a training set of transliteration pairs; this is done by pairing highly informative source language candi-date NEs with target language counterparts We de-veloped a criterion for adding new samples, Suffi-ciently Spanning Features (SSF), which quantifies the sampled set ability to span the feature space This is done by evaluating the L-1 distance be-tween the frequency distributions of source language word fragments in the current sampled set and in
a comprehensive set of source language NEs, serv-ing as reference We argue that since the features used for learning are n-gram features, once these two distributions are close enough, our examples space provides a good and concise characterization
of all named entities we will ever need to con-sider A special care should be given to choos-ing an appropriate reference; as a general guide-line the reference set should be representative of the testing data We collected a set R, consisting
Trang 3of 50,000 NE by crawling through Wikipedia’s
arti-cles and using an English NER system available at
- http://L2R.cs.uiuc.edu/ cogcomp The frequency
distribution was generated over all character level
bi-grams appearing in the text, as bi-grams best
cor-relate with the way features are extracted Given a
reference text R, the n-grams distribution of R can be
defined as follows -D R (ng i ) = P]ng i
j ]ng j ,where ng
is an n-gram in R Given a sample set S, we measure
the L1distance between the distributions:
distance (R,S) =Png∈R | D R (ng)−D S (ng) |
Sam-ples decreasing the distance between the
distribu-tions were added to the training data Given a set
C of candidates for annotation, a sample w s ∈ C
was added to the training set, if
-w s = argmin w∈C distance(R, D S ∪ {w s }).
A sample set is said to have SSF, if the distance
re-mains constant as more samples are added
2.2.1 Transliteration Oracle Implementation
The transliteration oracle is essentially a mapping
between the named entities, i.e given an NE in the
source language it provides the matching NE in the
target language An automatic oracle was
imple-mented by crawling through Wikipedia topic aligned
document pairs Given a pair of topic aligned
doc-uments in the two languages, the topic can be
iden-tified either by identifying the top ranking terms or
by simply identifying the title of the documents By
choosing documents in Wikipedia‘s biography
cate-gory we ensured that the topic of the documents is
person NE
2.3 Training the transliteration model
The feature extraction scheme we use generates
fea-tures by coupling substrings from the two terms
Ideally, given a positive sample, it is desirable that
paired substrings would encode phonetically
simi-lar or a distinctive context in which the two scripts
correlate Given enough positive samples, such
fea-tures will appear with distinctive frequency
Tak-ing this idea further, these features were recognized
by measuring the co-occurrence frequency of
sub-strings of up to two characters in both languages
Each feature f=(f s , f t ) composed of two substrings
taken from English and Hebrew words was
associ-ated with weight W (f) = ](f s ,f t)
](f s) × ](f s ,f t)
](f t) where
Data Set Method Rus Heb
Table 1: Results summary The numbers are the pro-portion of NE recognized in the target language Lines 1 and 2 compare the results of SSF directed approach with the baseline system on the first dataset Line 3 summa-rizes the results on the second dataset.
](f s , f t) is the number of occurrences of that feature
in the positive sample set, and ](f L) is the number of occurrences of an individual substring, in any of the features extracted from positive samples in the train-ing set The result of this process is a weight table,
in which, as we empirically tested, the highest rank-ing weights were assigned to features that preserve the phonetic correlation between the two languages
To improve the classifier’s learning rate, the learn-ing process is focused around these features Given
a sample, the learner is presented with a real-valued feature vector instead of a binary vector, in which each value indicates both that the feature is active and its activation strength - i.e the weight assigned
to it
3 Evaluation
We evaluated our approach in two settings; first, we compared our system to a baseline system described
in (Klementiev and Roth, 2006) Given a bilingual corpus with the English NE annotated, the system had to discover the NE in target language text We used the English-Russian news corpus used in the baseline system NEs were grouped into equiva-lence classes, each containing different variations of the same NE We randomly sampled 500 documents from the corpus Transliteration pairs were mapped into 97 equivalence classes, identified by an expert
In a second experiment, different learning parame-ters such as selective sampling efficiency and feature weights were checked 300 English-Russian and English-Hebrew NE pairs were used; negative sam-ples were generated by coupling every English NE with all other target language NEs Table 1 presents the key results of these experiments and compared with the baseline system
Trang 4Extraction Number Recall Recall
samples
Table 2: Comparison of correctly identified
English-Russian transliteration pairs in news corpus The model
trained using selective sampling outperforms models
trained using random sampling, even when trained with
twice the data The top one and top two results
columns describe the proportion of correctly identified
pairs ranked in the first and top two places, respectively.
3.1 Using SSF directed sampling
Table 2 describes the effect of directed sampling
in the English-Russian news corpora NE discovery
task Results show that models trained using
selec-tive sampling can outperform models trained with
more than twice the amount of data
3.2 Training using feature weights
Table 3 describes the effect training the model with
weights.The training set consisted of 150 samples
extracted using SSF directed sampling Three
varia-tions were tested - training without feature weights,
using the feature weights as the initial network
weights without training and training with weights
The results clearly show that using weights for
train-ing improve the classifier’s performance for both
Russian and Hebrew It can also be observed that
in many cases the correct pair was ranked in any of
the top five places
4 Conclusions and future work
In this paper we presented a new approach for
con-structing a transliteration model automatically and
efficiently by selectively extracting transliteration
samples covering relevant parts of the feature space
and focusing the learning process on these features
We show that our approach can outperform
sys-tems requiring supervision, manual intervention and
a considerable amount of data We propose a new
measure for selective sample selection which can be
used independently We currently investigate
apply-ing it in other domains with potentially larger feature
Learning Russian Hebrew Train- Feature Top Top Top Top ing weights one five one five
Table 3: The proportion of correctly identified transliter-ation pairs with/out using weights and training The top one and top five results columns describe the proportion
of correctly identified pairs ranked in the first place and
in any of the top five places, respectively The results demonstrate that using feature weights improves perfor-mance for both target languages.
space than used in this work Another aspect inves-tigated is using our selective sampling for adapting the learning process for data originating from dif-ferent sources; using the a reference set representa-tive of the testing data, training samples, originating from a different source , can be biased towards the testing data
5 Acknowledgments
Partly supported by NSF grant ITR IIS-0428472 and DARPA funding under the Bootstrap Learning Pro-gram
References
Y Freund and R E Schapire 1998 Large margin
clas-sification using the perceptron algorithm In COLT.
A Grove and D Roth 2001 Linear concepts and hidden
variables ML, 42.
A Klementiev and D Roth 2006 Weakly supervised named entity transliteration and discovery from
multi-lingual comparable corpora In ACL.
K Knight and J Graehl 1997 Machine transliteration.
In EACL.
D K McCallum and K Nigam 1998 Employing EM
in pool-based active learning for text classification In
ICML.
D Roth 1998 Learning to resolve natural language
am-biguities: A unified approach In AAAI.
H S Seung, M Opper, and H Sompolinsky 1992.
Query by committee In COLT.
T Tao, S Yoon, A Fister, R Sproat, and C Zhai 2006 Unsupervised named entity transliteration using
tem-poral and phonetic correlation In EMNLP.