c Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation Zhongguo Li State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laborator
Trang 1Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 1405–1414,
Portland, Oregon, June 19-24, 2011 c
Parsing the Internal Structure of Words:
A New Paradigm for Chinese Word Segmentation
Zhongguo Li State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology
Department of Computer Science and Technology Tsinghua University, Beijing 100084, China
eemath@gmail.com
Abstract
Lots of Chinese characters are very
produc-tive in that they can form many structured
words either as prefixes or as suffixes
Pre-vious research in Chinese word segmentation
mainly focused on identifying only the word
boundaries without considering the rich
inter-nal structures of many words In this paper we
argue that this is unsatisfying in many ways,
both practically and theoretically Instead, we
propose that word structures should be
recov-ered in morphological analysis An elegant
approach for doing this is given and the result
is shown to be promising enough for
encour-aging further effort in this direction Our
prob-ability model is trained with the Penn Chinese
Treebank and actually is able to parse both
word and phrase structures in a unified way.
Research in Chinese word segmentation has
pro-gressed tremendously in recent years, with state of
the art performing at around 97% in precision and
recall (Xue, 2003; Gao et al., 2005; Zhang and
Clark, 2007; Li and Sun, 2009) However, virtually
all these systems focus exclusively on recognizing
the word boundaries, giving no consideration to the
internal structures of many words Though it has
been the standard practice for many years, we argue
that this paradigm is inadequate both in theory and
in practice, for at least the following four reasons
The first reason is that if we confine our
defi-nition of word segmentation to the identification of
word boundaries, then people tend to have divergent
opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996) This has led to many dif-ferent annotation standards for Chinese word seg-mentation Even worse, this could cause inconsis-tency in the same corpus For instance, 䉂 擌 奒
‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University cor-pus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003) Meanwhile, 䉂䀓惼 ‘vice director’ and 䉂
䚲䡮 ‘deputy manager’ are both segmented into two words in the same Penn Chinese Treebank In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Fig-ure 1 Without a doubt, there is complete
agree-NN
l ,
JJf
䉂
NNf
擌奒
Figure 1: Example of a word with internal structure.
ment on the correctness of this structure among na-tive Chinese speakers So if instead of annotating only word boundaries, we annotate the structures of every word, 1 then the annotation tends to be more
1 Here it is necessary to add a note on terminology used in this paper Since there is no universally accepted definition
of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂擌奒 ‘vice president’ whose structure is shown
as the tree in Figure 1, or we might mean a smaller unit such as
擌奒 ‘president’ which is a substructure of that tree Hopefully,
1405
Trang 2consistent and there could be less duplication of
ef-forts in developing the expensive annotated corpus
The second reason is applications have different
requirements for granularity of words Take the
per-sonal name 撱嗤吼 ‘Zhou Shuren’ as an example
It’s considered to be one word in the Penn Chinese
Treebank, but is segmented into a surname and a
given name in the Peking University corpus For
some applications such as information extraction,
the former segmentation is adequate, while for
oth-ers like machine translation, the later finer-grained
output is more preferable If the analyzer can
pro-duce a structure as shown in Figure 4(a), then
ev-ery application can extract what it needs from this
tree A solution with tree output like this is more
el-egant than approaches which try to meet the needs
of different applications in post-processing (Gao et
al., 2004)
The third reason is that traditional word
segmen-tation has problems in handling many phenomena
in Chinese For example, the telescopic compound
㦌撥怂惆 ‘universities, middle schools and primary
schools’ is in fact composed of three coordinating
el-ements 㦌惆 ‘university’, 撥惆 ‘middle school’ and
怂惆 ‘primary school’ Regarding it as one flat word
loses this important information Another example
is separable words like 扩 扙 ‘swim’ With a
lin-ear segmentation, the meaning of ‘swimming’ as in
扩堑扙 ‘after swimming’ cannot be properly
rep-resented, since 扩扙 ‘swim’ will be segmented into
discontinuous units These language usages lie at the
boundary between syntax and morphology, and are
not uncommon in Chinese They can be adequately
represented with trees (Figure 2)
H H
JJ
H
H
JJf
㦌
JJf
撥
JJf
怂
NNf
惆
H H
VV
Z
VVf
扩
VVf
堑
NNf
扙
Figure 2: Example of telescopic compound (a) and
sepa-rable word (b).
The last reason why we should care about word
the context will always make it clear what is being referred to
with the term “word”.
structures is related to head driven statistical parsers (Collins, 2003) To illustrate this, note that in the Penn Chinese Treebank, the word 戽䊂䠽吼 ‘En-glish People’ does not occur at all Hence con-stituents headed by such words could cause some difficulty for head driven models in which out-of-vocabulary words need to be treated specially both when they are generated and when they are condi-tioned upon But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words
in Penn Chinese Treebank If we annotate the struc-ture of every compound containing this suffix (e.g Figure 3), such data sparsity simply goes away
NN
b b
"
"
NRf
戽䊂䠽
NNf
吼
Figure 3: Structure of the out-of-vocabulary word 戽䊂
䠽吼 ‘English People’.
Had there been only a few words with inter-nal structures, current Chinese word segmentation paradigm would be sufficient We could simply re-cover word structures in post-processing But this is far from the truth In Chinese there is a large number
of such words We just name a few classes of these words and give one example for each class (a dot is used to separate roots from affixes):
personal name: 㡿増·揽 ‘Nagao Makoto’
location name: 凝挕·撲 ‘New York State’ noun with a suffix: 䆩䡡·勬 ‘classifier’
noun with a prefix: 敏·䧥䧥 ‘mother-to-be’ verb with a suffix: 敧䃄·䑺 ‘automatize’
verb with a prefix: 䆓·噙 ‘waterproof’
adjective with a suffix: 䉅䏜·怮 ‘composite’ adjective with a prefix: 䆚·搔喪 ‘informal’ pronoun with a prefix: 䊈·墠 ‘everybody’
time expression: 憘䛊䛊壊·兣 ‘the year 1995’ ordinal number: 䀱·喛憘 ‘eleventh’
retroflex suffixation: 䑳䃹·䄎 ‘flower’
This list is not meant to be complete, but we can get
a feel of how extensive the words with non-trivial structures can be With so many productive suf-fixes and presuf-fixes, analyzing word structures in post-processing is difficult, because a character may or may not act as an affix depending on the context 1406
Trang 3For example, the character 吼 ‘people’ in 撇嗤吼
‘the one who plants’ is a suffix, but in the personal
name 撱嗤吼 ‘Zhou Shuren’ it isn’t The structures
of these two words are shown in Figure 4
(a) NR
Z
NFf
撱
NGf
嗤吼
(b) NN
Z
VVf
撇嗤
NNf
吼
Figure 4: Two words that differ only in one character,
but have different internal structures The character 吼
‘people’ is part of a personal name in tree (a), but is a
suffix in (b).
A second reason why generally we cannot
re-cover word structures in post-processing is that some
words have very complex structures For example,
the tree of 壃搕䈿擌懂揶 ‘anarchist’ is shown in
Figure 5 Parsing this structure correctly without a
principled method is difficult and messy, if not
im-possible
NN
a a a
!
!
!
NN
H H
VV
Z Z
VVf
壃
NNf
搕䈿
NNf
擌懂
NNf
揶
Figure 5: An example word which has very complex
structures.
Finally, it must be mentioned that we cannot store
all word structures in a dictionary, as the word
for-mation process is very dynamic and productive in
nature Take 䌬 ‘hall’ as an example Standard
Chi-nese dictionaries usually contain 埣嗖䌬 ‘library’,
but not many other words such as 䎰愒䌬
‘aquar-ium’ generated by this same character This is
un-derstandable since the character 䌬 ‘hall’ is so
pro-ductive that it is impossible for a dictionary to list
every word with this character as a suffix The same
thing happens for natural language processing
sys-tems Thus it is necessary to have a dynamic
mech-anism for parsing word structures
In this paper, we propose a new paradigm for Chinese word segmentation in which not only word boundaries are identified but the internal structures
of words are recovered (Section 3) To achieve this,
we design a joint morphological and syntactic pars-ing model of Chinese (Section 4) Our generative story describes the complete process from sentence and word structures to the surface string of char-acters in a top-down fashion With this probabil-ity model, we give an algorithm to find the parse tree of a raw sentence with the highest probabil-ity (Section 5) The output of our parser incorpo-rates word structures naturally Evaluation shows that the model can learn much of the regularity of word structures, and also achieves reasonable ac-curacy in parsing higher level constituent structures (Section 6)
The necessity of parsing word structures has been noticed by Zhao (2009), who presented a character-level dependency scheme as an alternative to the lin-ear representation of words Although our work is based on the same notion, there are two key dif-ferences The first one is that part-of-speech tags and constituent labels are fundamental for our pars-ing model, while Zhao focused on unlabeled depen-dencies between characters in a word, and part-of-speech information was not utilized Secondly, we distinguish explicitly the generation of flat words such as 䑵喏䃮 ‘Washington’ and words with inter-nal structures Our parsing algorithm also has to be adapted accordingly Such distinction was not made
in Zhao’s parsing model and algorithm
Many researchers have also noticed the awkward-ness and insufficiency of current boundary-only Chi-nese word segmentation paradigm, so they tried to customize the output to meet the requirements of various applications (Wu, 2003; Gao et al., 2004)
In a related research, Jiang et al (2009) presented a strategy to transfer annotated corpora between dif-ferent segmentation standards in the hope of saving some expensive human labor We believe the best solution to the problem of divergent standards and requirements is to annotate and analyze word tures Then applications can make use of these struc-tures according to their own convenience
1407
Trang 4Since the distinction between morphology and
syntax in Chinese is somewhat blurred, our model
for word structure parsing is integrated with
con-stituent parsing There has been many efforts to
in-tegrate Chinese word segmentation, part-of-speech
tagging and parsing (Wu and Zixin, 1998; Zhou and
Su, 2003; Luo, 2003; Fung et al., 2004) However,
in these research all words were considered to be
flat, and thus word structures were not parsed This
is a crucial difference with our work Specifically,
consider the word 碾 碜扨 ‘olive oil’ Our parser
output tree Figure 6(a), while Luo (2003) output tree
(b), giving no hint to the structure of this word since
the result is the same with a real flat word 䧢哫膝
‘Los Angeles’(c)
(a) NN
Z
NNf
碾碜
NNf
扨
(b) NN
NNf
碾碜扨
(c) NR
NRf
䧢哫膝
Figure 6: Difference between our output (a) of parsing
the word 碾碜扨 ‘olive oil’ and the output (b) of Luo
(2003) In (c) we have a true flat word, namely the
loca-tion name 䧢哫膝 ‘Los Angeles’.
The benefits of joint modeling has been noticed
by many For example, Li et al (2010) reported that
a joint syntactic and semantic model improved the
accuracy of both tasks, while Ng and Low (2004)
showed it’s beneficial to integrate word
segmenta-tion and part-of-speech tagging into one model The
later result is confirmed by many others (Zhang and
Clark, 2008; Jiang et al., 2008; Kruengkrai et al.,
2009) Goldberg and Tsarfaty (2008) showed that
a single model for morphological segmentation and
syntactic parsing of Hebrew yielded an error
reduc-tion of 12% over the best pipelined models This is
because an integrated approach can effectively take
into account more information from different levels
of analysis
Parsing of Chinese word structures can be
re-duced to the usual constituent parsing, for which
there has been great progress in the past several
years Our generative model for unified word and
phrase structure parsing is a direct adaptation of the
model presented by Collins (2003) Many other
ap-proaches of constituent parsing also use this kind
of head-driven generative models (Charniak, 1997; Bikel and Chiang, 2000)
Given a raw Chinese sentence like 䤕 撓 䏓 喴 敯
䋳 㢧 喓, a traditional word segmentation system would output some result like 䤕撓䏓 喴 敯䋳㢧 喓(‘Lin Zhihao’, ‘is’, ‘chief engineer’) In our new paradigm, the output should at least be a linear se-quence of trees representing the structures of each word as in Figure 7
NR
Q Q
NFf
䤕
NGf
撓䏓
VV
VVf
喴
NN
H H
JJ
JJf
敯
NN
Z Z
NNf
䋳㢧
NNf
喓
Figure 7: Proposed output for the new Chinese word seg-mentation paradigm.
Note that in the proposed output, all words are an-notated with their part-of-speech tags This is nec-essary since part-of-speech plays an important role
in the generation of compound words For example,
揶 ‘person’ usually combines with a verb to form a compound noun such as 唗䕏揶 ‘designer’
In this paper, we will actually design an integrated morphological and syntactical parser trained with
a treebank Therefore, the real output of our sys-tem looks like Figure 8 It’s clear that besides all
S
P P P P
NP
NR
Z
NFf
䤕
NGf
撓䏓
VP
a a a
!
!
!
VV
VVf
喴
NN
H H
JJ
JJf
敯
NN
Z
NNf
䋳㢧
NNf
喓
Figure 8: The actual output of our parser trained with a fully annotated treebank.
the information of the proposed output for the new 1408
Trang 5paradigm, our model’s output also includes
higher-level syntactic parsing results
3.1 Training Data
We employ a statistical model to parse phrase and
word structures as illustrated in Figure 8 The
cur-rently available treebank for us is the Penn Chinese
Treebank (CTB) 5.0 (Xue et al., 2005) Because our
model belongs to the family of head-driven
statisti-cal parsing models (Collins, 2003), we use the
head-finding rules described by Sun and Jurafsky (2004)
Unfortunately, this treebank or any other
tree-banks for that matter, does not contain annotations
of word structures Therefore, we must annotate
these structures by ourselves The good news is that
the annotation is not too complicated First, we
ex-tract all words in the treebank and check each of
them manually Words with non-trivial structures
are thus annotated Finally, we install these small
trees of words into the original treebank Whether a
word has structures or not is mostly context
indepen-dent, so we only have to annotate each word once
There are two noteworthy issues in this process
Firstly, as we’ll see in Section 4, flat words and
non-flat words will be modeled differently, thus it’s
important to adapt the part-of-speech tags to
facili-tate this modeling strategy For example, the tag for
nouns is NN as in 憞䠮䞎 ‘Iraq’ and 卣敯埚
‘for-mer president’ After annotation, the for‘for-mer is flat,
but the later has a structure (Figure 9) So we change
the POS tag for flat nouns to NNf, then during
bot-tom up parsing, whenever a new constituent ending
with ‘f’ is found, we can assign it a probability in a
way different from a structured word or phrase
Secondly, we should record the head position of
each word tree in accordance with the requirements
of head driven parsing models As an example, the
right tree in Figure 9 has the context free rule “NN
→ JJf NNf”, the head of which should be the
right-most NNf Therefore, in 卣敯埚 ‘former president’
the head is 敯埚 ‘president’
In passing, the readers should note the fact that
in Figure 9, we have to add a parent labeled NN to
the flat word 憞䠮䞎 ‘Iraq’ so as not to change the
context-free rules contained inherently in the
origi-nal treebank
(a) NN
NNf
憞䠮䞎
(b) NN
l ,
JJf
卣
NNf
敯埚
Figure 9: Example word structure annotation We add an
‘f’ to the POS tags of words with no further structures.
Given an observed raw sentences S, our generative model tells a story about how this surface sequence
of Chinese characters is generated with a linguisti-cally plausible morphological and syntactical pro-cess, thereby defining a joint probability Pr(T, S) where T is a parse tree carrying word structures as well as phrase structures With this model, the pars-ing problem is to search for the tree T∗such that
T∗ = arg max
T
Pr(T, S) (1)
The generation of S is defined in a top down fash-ion, which can be roughly summarized as follows First, the lexicalized constituent structures are gen-erated, then the lexicalized structure of each word
is generated Finally, flat words with no structures are generated As soon as this is done, we get a tree whose leaves are Chinese characters and can be con-catenated to get the surface character sequence S
4.1 Generation of Constituent Structures Each node in the constituent tree corresponds to a lexicalized context free rule
P → LnLn−1· · · L1HR1R2· · · Rm (2)
where P , Li, Riand H are lexicalized nonterminals and H is the head To generate this constituent, first
P is generated, then the head child H is generated conditioned on P , and finally each Li and Rj are generated conditioned on P and H and a distance metric This breakdown of lexicalized PCFG rules
is essentially the Model 2 defined by Collins (1999)
We refer the readers to Collins’ thesis for further de-tails
1409
Trang 64.2 Generation of Words with Internal
Structures
Words with rich internal structures can be described
using a context-free grammar formalism as
word → word suffix (4)
word → prefix word (5)
Here the root is any word without interesting internal
structures, and the prefixes and suffixes are not
lim-ited to single characters For example, 擌懂 ‘ism’ as
in 她㦓擌懂 ‘modernism’ is a well known and very
productive suffix Also, we can see that rules (4) and
(5) are recursive and hence can handle words with
very complex structures
By (3)–(5), the generation of word structures is
exactly the same as that of ordinary phrase
struc-tures Hence the probabilities of these words can be
defined in the same way as higher level constituents
in (2) Note that in our case, each word with
struc-tures is naturally lexicalized, since in the annotation
process we have been careful to record the head
po-sition of each complex word
As an example, consider a word w = R(r) S(s)
where R is the root part-of-speech headed by the
word r, and S is the suffix part-of-speech headed
by s If the head of this word is its suffix, then we
can define the probability of w by
Pr(w) = Pr(S, s) · Pr(R, r|S, s) (6)
This is equivalent to saying that to generate w, we
first generate its head S(s), then conditioned on this
head, other components of this word are generated
In actual parsing, because a word always occurs in
some contexts, the above probability should also be
conditioned on these contexts, such as its parent and
the parent’s head word
4.3 Generation of Flat Words
We say a word is flat if it contains only one
mor-pheme such as 憞䠮䞎 ‘Iraq’, or if it is a compound
like 䝭䅵 ‘develop’ which does not have a
produc-tive component we are currently interested in
De-pending on whether a flat word is known or not,
their generative probabilities are computed also
dif-ferently Generation of flat words seen in training is
trivial and deterministic since every phrase and word structure rules are lexicalized
However, the generation of unknown flat words
is a different story During training, words that oc-cur less than 6 times are substituted with the symbol
UNKNOWN In testing, unknown words are gener-ated after the generation of symbolUNKNOWN, and
we define their probability by a first-order Markov model That is, given a flat word w = c1c2· · · cn not seen in training, we define its probability condi-tioned with the part-of-speech p as
Pr(w|p) =
n+1
Y
i=1
Pr(ci|ci−1, p) (7)
where c0 is taken to be a STARTsymbol indicating the left boundary of a word and cn+1 is the STOP
symbol to indicate the right boundary Note that the generation of w is only conditioned on its part-of-speech p, ignoring the larger constituent or word in which w occurs
We use a back-off strategy to smooth the proba-bilities in (7):
˜ Pr(ci|ci−1, p) = λ1· ˆPr(ci|ci−1, p)
+ λ2· ˆPr(ci|ci−1) +λ3· ˆPr(ci) (8)
where λ1 + λ2 + λ3 = 1 to ensure the conditional probability is well formed These λs will be esti-mated with held-out data The probabilities on the right side of (8) can be estimated with simple counts:
ˆ Pr(ci|ci−1, p) = COUNT(ci−1ci, p)
COUNT(ci−1, p) (9) The other probabilities can be estimated in the same way
4.4 Summary of the Generative Story
We make a brief summary of our generative story for the integrated morphological and syntactic parsing model For a sentence S and its parse tree T , if we denote the set of lexicalized phrase structures in T
by C, the set of lexicalized word structures by W, and the set of unknown flat words by F , then the joint probability Pr(T, S) according to our model is
Pr(T, S) =Y
c∈C
Pr(c) Y
w∈W
Pr(w) Y
f ∈F
Pr(f ) (10)
1410
Trang 7In practice, the logarithm of this probability can be
calculated instead to avoid numerical difficulties
5 The Parsing Algorithm
To find the parse tree with highest probability we
use a chart parser adapted from Collins (1999) Two
key changes must be made to the search process,
though Firstly, because we are proposing a new
paradigm for Chinese word segmentation, the input
to the parser must be raw sentences by definition
Hence to use the bottom-up parser, we need a
lex-icon of all characters together with what roles they
can play in a flat word We can get this lexicon from
the treebank For example, from the word 撥愊/NNf
‘center’, we can extract a role bNNf for character 撥
‘middle’ and a role eNNf for character 愊 ‘center’
The role bNNf means the beginning of the flat
la-bel NNf, while eNNf stands for the end of the lala-bel
NNf This scheme was first proposed by Luo (2003)
in his character-based Chinese parser, and we find it
quite adequate for our purpose here
Secondly, in the bottom-up parser for head driven
models, whenever a new edge is found, we must
as-sign it a probability and a head word If the newly
discovered constituent is a flat word (its label ends
with ‘f’), then we set its head word to be the
con-catenation of all its child characters, i.e the word
itself If it is an unknown word, we use (7) to assign
the probability, otherwise its probability is set to be
1 On the other hand, if the new edge is a phrase or
word with internal structures, the probability is set
according to (2), while the head word is found with
the appropriate head rules In this bottom-up way,
the probability for a complete parse tree is known
as soon as it is completed This probability includes
both word generation probabilities and constituent
probabilities
For several reasons, it is a little tricky to evaluate the
accuracy of our model for integrated morphological
and syntactic parsing First and foremost, we
cur-rently know of no other same effort in parsing the
structures of Chinese words, and we have to
anno-tate word structures by ourselves Hence there is no
baseline performance to compare with Secondly,
simply reporting the accuracy of labeled precision
and recall is not very informative because our parser takes raw sentences as input, and its output includes
a lot of easy cases like word segmentation and part-of-speech tagging results
Despite these difficulties, we note that higher-level constituent parsing results are still somewhat comparable with previous performance in parsing Penn Chinese Treebank, because constituent parsing does not involve word structures directly Having said that, it must be pointed out that the comparison
is meaningful only in a limited sense, as in previous literatures on Chinese parsing, the input is always word segmented or even part-of-speech tagged That
is, the bracketing in our case is around characters instead of words Another observation is we can still evaluate Chinese word segmentation and part-of-speech tagging accuracy, by reading off the cor-responding result from parse trees Again because
we split the words with internal structures into their components, comparison with other systems should
be viewed with that in mind
Based on these discussions, we divide the labels
of all constituents into three categories:
Phrase labels are the labels in Peen Chinese Tree-bank for nonterminal phrase structures, includ-ing NP, VP, PP, etc
POS labels represent part-of-speech tags such as
NN, VV, DEG, etc
Flat labels are generated in our annotation for words with no interesting structures Recall that they always end with an ‘f’ such as NNf, VVf and DEGf, etc
With this classification, we report our parser’s ac-curacy for phrase labels, which is approximately the accuracy of constituent parsing of Penn Chinese Treebank We report our parser’s word segmenta-tion accuracy based on the flat labels This accu-racy is in fact the joint accuaccu-racy of segmentation and part-of-speech tagging Most importantly, we can report our parser’s accuracy in recovering word structures based on POS labels and flat labels, since word structures may contain only these two kinds of labels
With the standard split of CTB 5.0 data into train-ing, development and test sets (Zhang and Clark, 1411
Trang 82009), the result are summarized in Table 1 For all
label categories, the PARSEEVAL measures (Abney
et al., 1991) are used in computing the labeled
pre-cision and recall
Types LP LR F1
Phrase 79.3 80.1 79.7
Flat 93.2 93.8 93.5
Flat* 97.1 97.6 97.3
POS & Flat 92.7 93.2 92.9
Table 1: Labeled precision and recall for the three types
of labels The line labeled ‘Flat*’ is for unlabeled
met-rics of flat words, which is effectively the ordinary word
segmentation accuracy.
Though not directly comparable, we can make
some remarks to the accuracy of our model For
constituent parsing, the best result on CTB 5.0 is
reported to be 78% F1 measure for unlimited
sen-tences with automatically assigned POS tags (Zhang
and Clark, 2009) Our result for phrase labels is
close to this accuracy Besides, the result for flat
labels compares favorably with the state of the art
accuracy of about 93% F1 for joint word
segmen-tation and part-of-speech tagging (Jiang et al., 2008;
Kruengkrai et al., 2009) For ordinary word
segmen-tation, the best result is reported to be around 97%
F1on CTB 5.0 (Kruengkrai et al., 2009), while our
parser performs at 97.3%, though we should
remem-ber that the result concerns flat words only Finally,
we see the performance of word structure recovery
is almost as good as the recognition of flat words
This means that parsing word structures accurately
is possible with a generative model
It is interesting to see how well the parser does
in recognizing the structure of words that were not
seen during training For this, we sampled 100
such words including those with prefixes or suffixes
and personal names We found that for 82 of these
words, our parser can correctly recognize their
struc-tures This means our model has learnt something
that generalizes well to unseen words
In error analysis, we found that the parser tends
to over generalize for prefix and suffix characters
For example, 㦌斊䕛 ‘great writer’ is a noun phrase
consisting of an adjective 㦌 ‘great’ and a noun 斊䕛
‘writer’, as shown in Figure 10(a), but our parser
in-correctly analyzed it into a root 㦌斊 ‘masterpiece’ and a suffix 䕛 ‘expert’, as in Figure 10(b) This
(a) NP
l ,
JJ
JJf
㦌
NN
NNf
斊䕛
(b) NN
Z Z
NNf
㦌斊
NNf
䕛
Figure 10: Example of parser error Tree (a) is correct, and (b) is the wrong result by our parser.
is because the character 䕛 ‘expert’ is a very pro-ductive suffix, as in 䑺惆䕛 ‘chemist’ and 堉䘂䕛
‘diplomat’ This observation is illuminating because most errors of our parser follow this pattern Cur-rently we don’t have any non-ad hoc way of prevent-ing such kind of over generalization
7 Conclusion and Discussion
In this paper we proposed a new paradigm for Chi-nese word segmentation in which not only flat words were identified but words with structures were also parsed We gave good reasons why this should be done, and we presented an effective method show-ing how this could be done With the progress in statistical parsing technology and the development
of large scale treebanks, the time has now come for this paradigm shift to happen We believe such a new paradigm for word segmentation is linguisti-cally justified and pragmatilinguisti-cally beneficial to real world applications We showed that word struc-tures can be recovered with high precision, though there’s still much room for improvement, especially for higher level constituent parsing
Our model is generative, but discriminative mod-els such as maximum entropy technique (Berger
et al., 1996) can be used in parsing word struc-tures too Many parsers using these techniques have been proved to be quite successful (Luo, 2003; Fung et al., 2004; Wang et al., 2006) Another possible direction is to combine generative models with discriminative reranking to enhance the accu-racy (Collins and Koo, 2005; Charniak and Johnson, 2005)
Finally, we must note that the use of flat labels such as “NNf” is less than ideal The most impor-1412
Trang 9tant reason these labels are used is we want to
com-pare the performance of our parser with previous
re-sults in constituent parsing, part-of-speech tagging
and word segmentation, as we did in Section 6 The
problem with this approach is that word structures
and phrase structures are then not treated in a truly
unified way, and besides the 33 part-of-speech tags
originally contained in Penn Chinese Treebank,
an-other 33 tags ending with ‘f’ are introduced We
leave this problem open for now and plan to address
it in future work
Acknowledgments
I would like to thank Professor Maosong Sun for
many helpful discussions on topics of Chinese
mor-phological and syntactic analysis The author is
sup-ported by NSFC under Grant No 60873174
Heart-felt thanks also go to the reviewers for many
per-tinent comments which have greatly improved the
presentation of this paper
References
S Abney, S Flickenger, C Gdaniec, C Grishman,
P Harrison, D Hindle, R Ingria, F Jelinek, J
Kla-vans, M Liberman, M Marcus, S Roukos, B
San-torini, and T Strzalkowski 1991 Procedure for
quan-titatively comparing the syntactic coverage of English
grammars In E Black, editor, Proceedings of the
workshop on Speech and Natural Language, HLT ’91,
pages 306–311, Morristown, NJ, USA Association for
Computational Linguistics.
Adam L Berger, Vincent J Della Pietra, and Stephen A.
Della Pietra 1996 A maximum entropy approach to
natural language processing Computational
Linguis-tics, 22(1):39–71.
Daniel M Bikel and David Chiang 2000 Two
statis-tical parsing models applied to the Chinese treebank.
In Second Chinese Language Processing Workshop,
pages 1–6, Hong Kong, China, October Association
for Computational Linguistics.
Eugene Charniak and Mark Johnson 2005
Coarse-to-fine n-best parsing and maxent discriminative
rerank-ing In Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics, ACL ’05,
pages 173–180, Morristown, NJ, USA Association for
Computational Linguistics.
Eugene Charniak 1997 Statistical parsing with a
context-free grammar and word statistics In
Proceed-ings of the fourteenth national conference on artificial
intelligence and ninth conference on Innovative ap-plications of artificial intelligence, AAAI’97/IAAI’97, pages 598–603 AAAI Press.
Michael Collins and Terry Koo 2005 Discrimina-tive reranking for natural language parsing Compu-tational Linguistics, 31:25–70, March.
Michael Collins 1999 Head-Driven Statistical Models for Natural Language Parsing Ph.D thesis, Univer-sity of Pennsylvania.
Michael Collins 2003 Head-driven statistical models for natural language parsing Computational Linguis-tics, 29(4):589–637.
Pascale Fung, Grace Ngai, Yongsheng Yang, and Ben-feng Chen 2004 A maximum-entropy Chinese parser augmented by transformation-based learning ACM Transactions on Asian Language Information Processing, 3:159–168, June.
Jianfeng Gao, Andi Wu, Cheng-Ning Huang, Hong qiao
Li, Xinsong Xia, and Hauwei Qin 2004 Adaptive Chinese word segmentation In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 462–469, Barcelona, Spain, July.
Jianfeng Gao, Mu Li, Andi Wu, and Chang-Ning Huang.
2005 Chinese word segmentation and named entity recognition: A pragmatic approach Computational Linguistics, 31(4):531–574.
Yoav Goldberg and Reut Tsarfaty 2008 A single gener-ative model for joint morphological segmentation and syntactic parsing In Proceedings of ACL-08: HLT, pages 371–379, Columbus, Ohio, June Association for Computational Linguistics.
Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan L¨u.
2008 A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging In Proceed-ings of ACL-08: HLT, pages 897–904, Columbus, Ohio, June Association for Computational Linguis-tics.
Wenbin Jiang, Liang Huang, and Qun Liu 2009 Au-tomatic adaptation of annotation standards: Chinese word segmentation and POS tagging – a case study In Proceedings of the Joint Conference of the 47th An-nual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 522–530, Suntec, Singapore, Au-gust Association for Computational Linguistics Canasai Kruengkrai, Kiyotaka Uchimoto, Jun’ichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara 2009 An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Interna-tional Joint Conference on Natural Language
Process-1413
Trang 10ing of the AFNLP, pages 513–521, Suntec, Singapore,
August Association for Computational Linguistics.
Zhongguo Li and Maosong Sun 2009 Punctuation as
implicit annotations for Chinese word segmentation.
Computational Linguistics, 35:505–512, December.
Junhui Li, Guodong Zhou, and Hwee Tou Ng 2010.
Joint syntactic and semantic parsing of Chinese In
Proceedings of the 48th Annual Meeting of the
As-sociation for Computational Linguistics, pages 1108–
1117, Uppsala, Sweden, July Association for
Compu-tational Linguistics.
Xiaoqiang Luo 2003 A maximum entropy Chinese
character-based parser In Michael Collins and Mark
Steedman, editors, Proceedings of the 2003
Confer-ence on Empirical Methods in Natural Language
Pro-cessing, pages 192–199.
Hwee Tou Ng and Jin Kiat Low 2004 Chinese
part-of-speech tagging: One-at-a-time or all-at-once?
word-based or character-word-based? In Dekang Lin and Dekai
Wu, editors, Proceedings of EMNLP 2004, pages 277–
284, Barcelona, Spain, July Association for
Computa-tional Linguistics.
Richard Sproat and Thomas Emerson 2003 The first
international Chinese word segmentation bakeoff In
Proceedings of the Second SIGHAN Workshop on
Chi-nese Language Processing, pages 133–143, Sapporo,
Japan, July Association for Computational
Linguis-tics.
Richard Sproat, William Gale, Chilin Shih, and Nancy
Chang 1996 A stochastic finite-state
word-segmentation algorithm for Chinese Computational
Linguistics, 22(3):377–404.
Honglin Sun and Daniel Jurafsky 2004 Shallow
se-mantc parsing of Chinese In Daniel Marcu
Su-san Dumais and Salim Roukos, editors, HLT-NAACL
2004: Main Proceedings, pages 249–256, Boston,
Massachusetts, USA, May 2 - May 7 Association for
Computational Linguistics.
Mengqiu Wang, Kenji Sagae, and Teruko Mitamura.
2006 A fast, accurate deterministic parser for chinese.
In Proceedings of the 21st International Conference
on Computational Linguistics and 44th Annual
Meet-ing of the Association for Computational LMeet-inguistics,
pages 425–432, Sydney, Australia, July Association
for Computational Linguistics.
Andi Wu and Jiang Zixin 1998 Word segmentation in
sentence analysis In Proceedings of the 1998
Interna-tional Conference on Chinese information processing,
Beijing, China.
Andi Wu 2003 Customizable segmentation of
morpho-logically derived words in Chinese Computational
Linguistics and Chinese language processing, 8(1):1–
28.
Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer 2005 The Penn Chinese Treebank: phrase structure annotation of a large corpus Natural Lan-guage Engineering, 11(2):207–238.
Nianwen Xue 2003 Chinese word segmentation as character tagging Computational Linguistics and Chinese Language Processing, 8(1):29–48.
Yue Zhang and Stephen Clark 2007 Chinese segmenta-tion with a word-based perceptron algorithm In Pro-ceedings of the 45th Annual Meeting of the Association
of Computational Linguistics, pages 840–847, Prague, Czech Republic, June Association for Computational Linguistics.
Yue Zhang and Stephen Clark 2008 Joint word segmen-tation and POS tagging using a single perceptron In Proceedings of ACL-08: HLT, pages 888–896, Colum-bus, Ohio, June Association for Computational Lin-guistics.
Yue Zhang and Stephen Clark 2009 Transition-based parsing of the Chinese treebank using a global dis-criminative model In Proceedings of the 11th Inter-national Conference on Parsing Technologies, IWPT
’09, pages 162–171, Morristown, NJ, USA Associa-tion for ComputaAssocia-tional Linguistics.
Hai Zhao 2009 Character-level dependencies in Chi-nese: Usefulness and learning In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 879–887, Athens, Greece, March Association for Computational Linguistics.
Guodong Zhou and Jian Su 2003 A Chinese effi-cient analyser integrating word segmentation, part-of-speech tagging, partial parsing and full parsing In Proceedings of the Second SIGHAN Workshop on Chi-nese Language Processing, pages 78–83, Sapporo, Japan, July Association for Computational Linguis-tics.
1414