While each of these three sources of world knowl-edge has been shown to improve coreference resolu-tion, the improvements were typically obtained by incorporating world knowledge as feat
Trang 1Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 814–824,
Portland, Oregon, June 19-24, 2011 c
Coreference Resolution with World Knowledge
Altaf Rahman and Vincent Ng
Human Language Technology Research Institute
University of Texas at Dallas Richardson, TX 75083-0688
{altaf,vince}@hlt.utdallas.edu
Abstract
While world knowledge has been shown to
improve learning-based coreference resolvers,
the improvements were typically obtained by
incorporating world knowledge into a fairly
weak baseline resolver Hence, it is not clear
whether these benefits can carry over to a
stronger baseline Moreover, since there has
been no attempt to apply different sources of
world knowledge in combination to
corefer-ence resolution, it is not clear whether they
of-fer complementary benefits to a resolver We
systematically compare commonly-used and
under-investigated sources of world
knowl-edge for coreference resolution by applying
them to two learning-based coreference
mod-els and evaluating them on documents
anno-tated with two different annotation schemes.
1 Introduction
Noun phrase (NP) coreference resolution is the task
of determining which NPs in a text or dialogue refer
to the same real-world entity The difficulty of the
task stems in part from its reliance on world
knowl-edge (Charniak, 1972) To exemplify, consider the
following text fragment
Martha Stewart is hoping people don’t run out on her.
The celebrity indicted on charges stemming from
Having the (world) knowledge that Martha Stewart
is a celebrity would be helpful for establishing the
coreference relation between the two NPs One may
argue that employing heuristics such as subject
pref-erence or syntactic parallelism (which prefers
re-solving an NP to a candidate antecedent that has the
same grammatical role) in this example would also
allow us to correctly resolve the celebrity (Mitkov,
2002), thereby obviating the need for world knowl-edge However, since these heuristics are not per-fect, complementing them with world knowledge would be an important step towards bringing coref-erence systems to the next level of performance Despite the usefulness of world knowledge for coreference resolution, early learning-based coref-erence resolvers have relied mostly on morpho-syntactic features (e.g., Soon et al (2001), Ng and Cardie (2002), Yang et al (2003)) With recent ad-vances in lexical semantics research and the devel-opment of large-scale knowledge bases, researchers have begun to employ world knowledge for corefer-ence resolution World knowledge is extracted pri-marily from three data sources, web-based encyclo-pedia (e.g., Ponzetto and Strube (2006), Uryupina
et al (2011)), unannotated data (e.g., Daum´e III and Marcu (2005), Ng (2007)), and coreference-annotated data (e.g., Bengtson and Roth (2008)) While each of these three sources of world knowl-edge has been shown to improve coreference resolu-tion, the improvements were typically obtained by incorporating world knowledge (as features) into a baseline resolver composed of a rather weak coref-erence model (i.e., the mention-pair model) and a small set of features (i.e., the 12 features adopted
by Soon et al.’s (2001) knowledge-lean approach)
As a result, some questions naturally arise First, can world knowledge still offer benefits when used
in combination with a richer set of features? Sec-ond, since automatically extracted world knowledge
is typically noisy (Ponzetto and Poesio, 2009), are recently-developed coreference models more noise-tolerant than the mention-pair model, and if so, can they profit more from the noisily extracted world knowledge? Finally, while different world
knowl-814
Trang 2edge sources have been shown to be useful when
ap-plied in isolation to a coreference system, do they
of-fer complementary benefits and therefore can further
improve a resolver when applied in combination?
We seek answers to these questions by
conduct-ing a systematic evaluation of different world
knowl-edge sources for learning-based coreference
reso-lution Specifically, we (1) derive world
knowl-edge from encyclopedic sources that are
under-investigated for coreference resolution, including
FrameNet (Baker et al., 1998) and YAGO (Suchanek
et al., 2007), in addition to coreference-annotated
data and unannotated data; (2) incorporate such
knowledge as features into a richer baseline feature
set that we previously employed (Rahman and Ng,
2009); and (3) evaluate their utility using two
coref-erence models, the traditional mention-pair model
(Soon et al., 2001) and the recently developed
cluster-ranking model (Rahman and Ng, 2009)
Our evaluation corpus contains 410 documents,
which are coreference-annotated using the ACE
an-notation scheme as well as the OntoNotes
annota-tion scheme (Hovy et al., 2006) By evaluating on
two sets of coreference annotations for the same set
of documents, we can determine whether the
use-fulness of world knowledge sources for coreference
resolution is dependent on the underlying annotation
scheme used to annotate the documents
2 Preliminaries
In this section, we describe the corpus, the NP
ex-traction methods, the coreference models, and the
evaluation measures we will use in our evaluation
2.1 Data Set
We evaluate on documents that are
coreference-annotated using both the ACE annotation scheme
and the OntoNotes annotation scheme, so that we
can examine whether the usefulness of our world
knowledge sources is dependent on the underlying
coreference annotation scheme Specifically, our
data set is composed of the 410 English newswire
articles that appear in both OntoNotes-2 and ACE
2004/2005 We partition the documents into a
train-ing set and a test set followtrain-ing a 80/20 ratio
ACE and OntoNotes employ different
guide-lines to annotate coreference chains A major
difference between the two annotation schemes is that ACE only concerns establishing coreference chains among NPs that belong to the ACE entity types, whereas OntoNotes does not have this re-striction Hence, the OntoNotes annotation scheme should produce more coreference chains (i.e., non-singleton coreference clusters) than the ACE anno-tation scheme for a given set of documents For our data set, the OntoNotes scheme yielded 4500 chains, whereas the ACE scheme yielded only 3637 chains Another difference between the two annotation schemes is that singleton clusters are annotated in ACE but not OntoNotes As discussed below, the presence of singleton clusters may have an impact
on NP extraction and coreference evaluation
2.2 NP Extraction
Following common practice, we employ different methods to extract NPs from the documents anno-tated with the two annotation schemes
To extract NPs from the ACE-annotated docu-ments, we train a mention extractor on the train-ing texts (see Section 5.1 of Rahman and Ng (2009) for details), which recalls 83.6% of the NPs in the test set On the other hand, to extract NPs from the OntoNotes-annotated documents, the same method should not be applied To see the reason, recall that only the NPs in non-singleton clusters are annotated
in these documents Training a mention extractor
on these NPs implies that we are learning to
ex-tract non-singleton NPs, which are typically much
smaller in number than the entire set of NPs In other words, doing so could substantially simplify the coreference task Consequently, we follow the approach adopted by traditional learning-based re-solvers and employ an NP chunker to extract NPs Specifically, we use the markable identification sys-tem in the Reconcile resolver (Stoyanov et al., 2010)
to extract NPs from the training and test texts This identifier recalls 77.4% of the NPs in the test set
2.3 Coreference Models
We evaluate the utility of world knowledge using the mention-pair model and the cluster-ranking model
2.3.1 Mention-Pair Model
The mention-pair (MP) model is a classifier that determines whether two NPs are coreferent or not
815
Trang 3Each instance i(NPj, NPk) corresponds to NPj and
NPk, and is represented by a Baseline feature set
con-sisting of 39 features Linguistically, these features
can be divided into four categories: string-matching,
grammatical, semantic, and positional These
fea-tures can also be categorized based on whether they
are relational or not Relational features capture
the relationship between NPj and NPk, whereas
non-relational features capture the linguistic property of
one of these two NPs Since space limitations
pre-clude a description of these features, we refer the
reader to Rahman and Ng (2009) for details
We follow Soon et al.’s (2001) method for
cre-ating training instances: we create (1) a positive
instance for each anaphoric NP, NPk, and its
clos-est antecedent, NPj; and (2) a negative instance for
NPk paired with each of the intervening NPs,NPj+1,
NPj+2, , NPk−1 The classification of a training
instance is either positive or negative, depending on
whether the two NPs are coreferent in the associated
text To train the MP model, we use the SVM
learn-ing algorithm from SVMlight(Joachims, 2002).1
After training, the classifier is used to identify an
antecedent for an NP in a test text Specifically, each
NP,NPk, is compared in turn to each preceding NP,
NPj, from right to left, and NPj is selected as its
an-tecedent if the pair is classified as coreferent The
process terminates as soon as an antecedent is found
forNPkor the beginning of the text is reached
Despite its popularity, the MP model has two
major weaknesses First, since each candidate
an-tecedent for an NP to be resolved (henceforth an
ac-tive NP) is considered independently of the others,
this model only determines how good a candidate
antecedent is relative to the active NP, but not how
good a candidate antecedent is relative to other
can-didates So, it fails to answer the critical question of
which candidate antecedent is most probable
Sec-ond, it has limitations in its expressiveness: the
in-formation extracted from the two NPs alone may not
be sufficient for making a coreference decision
2.3.2 Cluster-Ranking Model
The cluster-ranking (CR) model addresses the two
weaknesses of the MP model by combining the
strengths of the entity-mention model (e.g., Luo et
1 For this and subsequent uses of the SVM learner in our
experiments, we set all parameters to their default values.
al (2004), Yang et al (2008)) and the
mention-ranking model (e.g., Denis and Baldridge (2008)).
Specifically, the CR model ranks the preceding ters for an active NP so that the highest-ranked clus-ter is the one to which the active NP should be linked Employing a ranker addresses the first weak-ness, as a ranker allows all candidates to be
com-pared simultaneously Considering preceding
clus-ters rather than antecedents as candidates addresses
the second weakness, as cluster-level features (i.e.,
features that are defined over any subset of NPs in a preceding cluster) can be employed Details of the
CR model can be found in Rahman and Ng (2009) Since the CR model ranks preceding clusters, a
training instance i(cj, NPk) represents a preceding cluster, cj, and an anaphoric NP,NPk Each instance consists of features that are computed based solely
on NPk as well as cluster-level features, which de-scribe the relationship between cj and NPk Mo-tivated in part by Culotta et al (2007), we create
cluster-level features from the relational features in
our feature set using four predicates: NONE,MOST
-FALSE,MOST-TRUE, andALL Specifically, for each relational featureX, we first convertXinto an equiv-alent set of binary-valued features if it is multi-valued Then, for each resulting binary-valued fea-ture Xb, we create four binary-valued cluster-level features: (1) NONE-Xb is true when Xb is false be-tween NPk and each NP in cj; (2) MOST-FALSE-Xb
is true whenXbis true betweenNPkand less than half (but at least one) of the NPs in cj; (3)MOST-TRUE
-Xb is true when Xb is true betweenNPk and at least half (but not all) of the NPs in cj; and (4)ALL-Xb is true whenXbis true betweenNPkand each NP in cj
We train a cluster ranker to jointly learn anaphoricity determination and coreference reso-lution using SVMlight’s ranker-learning algorithm Specifically, for each NP, NPk, we create a training instance betweenNPkand each preceding cluster cj
using the features described above Since we are learning a joint model, we need to provide the ranker with the option to start a new cluster by creating an additional training instance that contains the non-relational features describing NPk The rank value
of a training instance i(cj,NPk) created forNPkis the rank of cj among the competing clusters If NPk is anaphoric, its rank is HIGHifNPkbelongs to cj, and
LOWotherwise If NPkis non-anaphoric, its rank is
816
Trang 4LOWunless it is the additional training instance
de-scribed above, which has rank HIGH
After training, the cluster ranker processes the
NPs in a test text in a left-to-right manner For each
active NP,NPk, we create test instances for it by
pair-ing it with each of its precedpair-ing clusters To allow
for the possibility thatNPkis non-anaphoric, we
cre-ate an additional test instance as during training All
these test instances are then presented to the ranker
If the additional test instance is assigned the highest
rank value, then we create a new cluster containing
NPk Otherwise, NPkis linked to the cluster that has
the highest rank Note that the partial clusters
pre-ceding NPk are formed incrementally based on the
predictions of the ranker for the first k− 1 NPs
2.4 Evaluation Measures
We employ two commonly-used scoring programs,
B3 (Bagga and Baldwin, 1998) and CEAF (Luo,
2005), both of which report results in terms of recall
(R), precision (P), and F-measure (F) by comparing
the gold-standard (i.e., key) partition, KP , against
the system-generated (i.e., response) partition, RP
Briefly, B3 computes the R and P values of each
NP and averages these values at the end
Specifi-cally, for each NP,NPj, B3first computes the number
of common NPs in KPj and RPj, the clusters
con-taining NPj in KP and RP , respectively, and then
divides this number by |KPj| and |RPj| to obtain
the R and P values ofNPj, respectively On the other
hand, CEAF finds the best one-to-one alignment
be-tween the key clusters and the response clusters
A complication arises when B3 is used to score
a response partition containing automatically
ex-tracted NPs Recall that B3 constructs a mapping
between the NPs in the response and those in the
key Hence, if the response is generated using
gold-standard NPs, then every NP in the response is
mapped to some NP in the key and vice versa In
other words, there are no twinless (i.e., unmapped)
NPs (Stoyanov et al., 2009) This is not the case
when automatically extracted NPs are used, but the
original description of B3 does not specify how
twinless NPs should be scored (Bagga and Baldwin,
1998) To address this problem, we set the recall
and precision of a twinless NP to zero, regardless of
whether the NP appears in the key or the response
Note that CEAF can compare partitions with
twin-less NPs without any modification, since it operates
by finding the best alignment between the clusters in the two partitions
Additionally, in order not to over-penalize a re-sponse partition, we remove all the twinless NPs in the response that are singletons The rationale is simple: since the resolver has successfully identified these NPs as singletons, it should not be penalized, and removing them avoids such penalty
Since B3 and CEAF align NPs/clusters, the lack
of singleton clusters in the OntoNotes annotations implies that the resulting scores reflect solely how well a resolver identifies coreference links and do not take into account how well it identifies singleton clusters
3 Extracting World Knowledge
In this section, we describe how we extract world knowledge for coreference resolution from three different sources: large-scale knowledge bases, coreference-annotated data and unannotated data
3.1 World Knowledge from Knowledge Bases
We extract world knowledge from two large-scale knowledge bases, YAGO and FrameNet
3.1.1 Extracting Knowledge from YAGO
We choose to employ YAGO rather than the more popularly-used Wikipedia due to its potentially richer knowledge, which comprises 5 million facts extracted from Wikipedia and WordNet Each fact
is represented as a triple (NPj, rel,NPk), where rel
is one of the 90 YAGO relation types defined on two NPs, NPj and NPk Motivated in part by previ-ous work (Bryl et al., 2010; Uryupina et al., 2011),
we employ the two relation types that we believe are most useful for coreference resolution, TYPE
and MEANS TYPE is essentially an IS-A relation For instance, the triple (AlbertEinstein, TYPE,
physicist) denotes the fact that Albert Einstein
is a physicist MEANS provides different ways of expressing an entity, and therefore allows us to deal with synonymy and ambiguity For instance, the two triples (Einstein, MEANS,AlbertEinstein) and (Einstein, MEANS, AlfredEinstein)
denote the facts that Einstein may refer to the physi-cist Albert Einstein and the musicologist Alfred
Ein-stein, respectively Hence, the presence of one or
817
Trang 5both of these relations between two NPs provides
strong evidence that the two NPs are coreferent
YAGO’s unification of the information in
Wikipedia and WordNet enables it to extract
facts that cannot be extracted with Wikipedia
or WordNet alone, such as (MarthaStewart,
TYPE,celebrity) To better appreciate YAGO’s
strengths, let us see how this fact was extracted
YAGO first heuristically maps each of the Wiki
categories in the Wiki page for Martha Stewart
to its semantically closest WordNet synset For
instance, the Wiki category AMERICAN TELE
-VISION PERSONALITIES is mapped to the synset
corresponding to sense #2 of the word personality.
Then, given that personality is a direct hyponym of
celebrity in WordNet, YAGO extracts the desired
fact This enables YAGO to extract facts that cannot
be extracted with Wikipedia or WordNet alone
We incorporate the world knowledge from YAGO
into our coreference models as a binary-valued
fea-ture If the MP model is used, the YAGO feature
for an instance will have the value 1 if and only if
the two NPs involved are in a TYPE or MEANS
re-lation On the other hand, if the CR model is used,
the YAGO feature for an instance involvingNPkand
preceding cluster c will have the value 1 if and only
if NPk has a TYPE or MEANS relation with any of
the NPs in c Since knowledge extraction from
web-based encyclopedia is typically noisy (Ponzetto and
Poesio, 2009), we use YAGO to determine whether
two NPs have a relation only if one NP is a named
entity (NE) of type person, organization, or location
according to the Stanford NE recognizer (Finkel et
al., 2005) and the other NP is a common noun
3.1.2 Extracting Knowledge from FrameNet
FrameNet is a lexico-semantic resource focused on
semantic frames (Baker et al., 1998) As a schematic
representation of a situation, a frame contains the
lexical predicates that can invoke it as well as the
frame elements (i.e., semantic roles) For example,
the JUDGMENT COMMUNICATION frame describes
situations in which a COMMUNICATOR
communi-cates a judgment of an EVALUEEto an ADDRESSEE
This frame has COMMUNICATOR and EVALUEE as
its core frame elements and ADDRESSEEas its
non-core frame elements, and can be invoked by more
than 40 predicates, such as acclaim, accuse,
com-mend, decry, denounce, praise, and slam.
To better understand why FrameNet contains po-tentially useful knowledge for coreference resolu-tion, consider the following text segment:
Peter Anthony decries program trading as “limiting the game to a few,” but he is not sure whether he wants to denounce it because
To establish the coreference relation between it and
program trading, it may be helpful to know that de-cry and denounce appear in the same frame and the
two NPs have the same semantic role
This example suggests that features encoding both the semantic roles of the two NPs under considera-tion and whether the associated predicates are “re-lated” to each other in FrameNet (i.e., whether they appear in the same frame) could be useful for iden-tifying coreference relations Two points regarding our implementation of these features deserve men-tion First, since we do not employ verb sense
dis-ambiguation, we consider two predicates related as
long as there is at least one semantic frame in which they both appear Second, since FrameNet-style se-mantic role labelers are not publicly available, we use ASSERT (Pradhan et al., 2004), a semantic role labeler that provides PropBank-style semantic roles such as ARG0 (the PROTOAGENT, which is typi-cally the subject of a transitive verb) and ARG1 (the
PROTOPATIENT, which is typically its direct object) Now, assuming that NPj and NPk are the
argu-ments of two stemmed predicates, predj and predk,
we create 15 features using the knowledge extracted from FrameNet and ASSERT as follows First, we encode the knowledge extracted from FrameNet as
one of three possible values: (1) predj and predk
are in the same frame; (2) they are both predicates
in FrameNet but never appear in the same frame; and (3) one or both predicates do not appear in FrameNet Second, we encode the semantic roles of
NPj and NPk as one of five possible values: ARG
0-ARG0, ARG1-ARG1, ARG0-ARG1, ARG1-ARG0, and OTHERS (the default case).2 Finally, we create
15 binary-valued features by pairing the 3 possible values extracted from FrameNet and the 5 possible values provided by ASSERT Since these features
2
We focus primarily on A RG 0 and A RG 1 because they are the most important core arguments of a predicate and may pro-vide more useful information than other semantic roles.
818
Trang 6are computed over two NPs, we can employ them
di-rectly for the MP model Note that by construction,
exactly one of these features will have a non-zero
value For the CR model, we extend their definitions
so that they can be computed between an NP, NPk,
and a preceding cluster, c Specifically, the value of
a feature is 1 if and only if its value betweenNPkand
one of the NPs in c is 1 under its original definition
The above discussion assumes that the two NPs
under consideration serve as predicate arguments If
this assumption fails, we will not create any features
based on FrameNet for these two NPs
To our knowledge, FrameNet has not been
ex-ploited for coreference resolution However, the
use of related verbs is similar in spirit to Bean and
Riloff’s (2004) use of patterns for inducing
contex-tual role knowledge, and the use of semantic roles is
also discussed in Ponzetto and Strube (2006)
3.2 World Knowledge from Annotated Data
Since world knowledge is needed for coreference
resolution, a human annotator must have employed
world knowledge when coreference-annotating a
document We aim to design features that can
“re-cover” such world knowledge from annotated data
3.2.1 Features Based on Noun Pairs
A natural question is: what kind of world
knowl-edge can we extract from annotated data? We may
gather the knowledge that Barack Obama is a U.S.
president if we see these two NPs appearing in the
same coreference chain Equally importantly, we
may gather the commonsense knowledge needed for
determining non-coreference For instance, we may
discover that a lion and a tiger are unlikely to refer
to the same real-world entity after realizing that they
never appear in the same chain in a large number of
annotated documents Note that any features
com-puted based on WordNet distance or distributional
similarity are likely to incorrectly suggest that lion
and tiger are coreferent, since the two nouns are
sim-ilar distributionally and according to WordNet
Given these observations, one may collect the
noun pairs from the (coreference-annotated)
train-ing data and use them as features to train a resolver
However, for these features to be effective, we need
to address data sparseness, as many noun pairs in
the training data may not appear in the test data
To improve generalization, we instead create
dif-ferent kinds of noun-pair-based features given an
annotated text To begin with, we preprocess each
document A training text is preprocessed by
ran-domly replacing 10% of its common nouns with the labelUNSEEN If an NP,NPk, is replaced with UN
-SEEN, all NPs that have the same string as NPk will also be replaced withUNSEEN A test text is
prepro-cessed differently: we simply replace all NPs whose strings are not seen in the training data with UN
-SEEN Hence, artificially creating UNSEEN labels from a training text will allow a learner to learn how
to handle unseen words in a test text
Next, we create noun-pair-based features for the
MP model, which will be used to augment the Base-line feature set Here, each instance corresponds to two NPs, NPj and NPk, and is represented by three
groups of binary-valued features.
Unseen features are applicable when both NPj
andNPkareUNSEEN Either anUNSEEN-SAME fea-ture or anUNSEEN-DIFF feature is created, depend-ing on whether the two NPs are the same strdepend-ing be-fore being replaced with theUNSEENtoken
Lexical features are applicable when neitherNPj
norNPk isUNSEEN A lexical feature is an ordered pair consisting of the heads of the NPs For a pro-noun or a common pro-noun, the head is the last word of the NP; for a proper name, the head is the entire NP
Semi-lexical features aim to improve
generaliza-tion, and are applicable when neitherNPj norNPkis
UNSEEN If exactly one of NPj and NPk is tagged
as a NE by the Stanford NE recognizer, we create
a semi-lexical feature that is identical to the lexical feature described above, except that the NE is re-placed with its NE label On the other hand, if both NPs are NEs, we check whether they are the same string If so, we create a *NE*-SAMEfeature, where
*NE* is replaced with the corresponding NE label Otherwise, we check whether they have the same NE
tag and a word-subset match (i.e., whether the word
tokens in one NP appears in the other’s list of word tokens) If so, we create a *NE*-SUBSAME feature, where *NE* is replaced with their NE label Other-wise, we create a feature that is the concatenation of the NE labels of the two NPs
The noun-pair-based features for the CR model can be generated using essentially the same method Specifically, since each instance now corresponds to
819
Trang 7an NP,NPk, and a preceding cluster, c, we can
gener-ate a noun-pair-based feature by applying the above
method toNPkand each of the NPs in c, and its value
is the number of times it is applicable toNPkand c
3.2.2 Features Based on Verb Pairs
As discussed above, features encoding the
seman-tic roles of two NPs and the relatedness of the
asso-ciated verbs could be useful for coreference
resolu-tion Rather than encoding verb relatedness, we may
replace verb relatedness with the verbs themselves
in these features, and have the learner learn directly
from coreference-annotated data whether two NPs
serving as the objects of decry and denounce are
likely to be coreferent or not, for instance
Specifically, assuming that NPj and NPk are the
arguments of two stemmed predicates, predj and
predk, in the training data, we create five features
as follows First, we encode the semantic roles of
NPj and NPk as one of five possible values: ARG
0-ARG0, ARG1-ARG1, ARG0-ARG1, ARG1-ARG0,
and OTHERS (the default case) Second, we create
five binary-valued features by pairing each of these
five values with the two stemmed predicates Since
these features are computed over two NPs, we can
employ them directly for the MP model Note that
by construction, exactly one of these features will
have a non-zero value For the CR model, we extend
their definitions so that they can be computed
be-tween an NP,NPk, and a preceding cluster, c
Specif-ically, the value of a feature is 1 if and only if its
value between NPkand one of the NPs in c is 1
un-der its original definition
The above discussion assumes that the two NPs
under consideration serve as predicate arguments If
this assumption fails, we will not create any features
based on verb pairs for these two NPs
3.3 World Knowledge from Unannotated Data
Previous work has shown that syntactic
apposi-tions, which can be extracted using heuristics from
unannotated documents or parse trees, are a useful
source of world knowledge for coreference
resolu-tion (e.g., Daum´e III and Marcu (2005), Ng (2007),
Haghighi and Klein (2009)) Each extraction is an
NP pair such as <Barack Obama, the president>
and <Eastern Airlines, the carrier>, where the first
NP in the pair is a proper name and the second NP is
a common NP Low-frequency extractions are typi-cally assumed to be noisy and discarded
We combine the extractions produced by Fleis-chman et al (2003) and Ng (2007) to form a database consisting of 1.057 million NP pairs, and create a binary-valued feature for our coreference models using this database If the MP model is used, this feature will have the value 1 if and only if the two NPs appear as a pair in the database On the other hand, if the CR model is used, the feature for
an instance involving NPk and preceding cluster c will have the value 1 if and only ifNPk and at least one of the NPs in c appears as a pair in the database
4 Evaluation 4.1 Experimental Setup
As described in Section 2, we use as our evalua-tion corpus the 411 documents that are coreference-annotated using the ACE and OntoNotes annota-tion schemes Specifically, we divide these docu-ments into five (disjoint) folds of roughly the same size, training the MP model and the CR model us-ing SVMlight on four folds and evaluate their per-formance on the remaining fold The linguistic fea-tures, as well as the NPs used to create the training and test instances, are computed automatically We employ B3and CEAF as described in Section 2.3 to score the output of a coreference system
4.2 Results and Discussion 4.2.1 Baseline Models
Since our goal is to evaluate the effectiveness of the features encoding world knowledge for learning-based coreference resolution, we employ as our baselines the MR model and the CR model trained
on the Baseline feature set, which does not con-tain any features encoding world knowledge For the MP model, the Baseline feature set consists of the 39 features described in Section 2.3.1; for the
CR model, the Baseline feature set consists of the cluster-level features derived from the 39 features used in the Baseline MP model (see Section 2.3.2) Results of the MP model and the CR model em-ploying the Baseline feature set are shown in rows 1 and 8 of Table 1, respectively Each row contains the
B3 and CEAF results of the corresponding corefer-ence model when it is evaluated using the ACE and
820
Trang 8ACE OntoNotes
Results for the Mention-Pair Model
1 Base 56.5 69.7 62.4 54.9 66.3 60.0 50.4 56.7 53.3 48.9 54.5 51.5
2 Base+YAGO Types (YT) 57.3 70.3 63.1 58.7 67.5 62.8 51.7 57.9 54.6 50.3 55.6 52.8
3 Base+YAGO Means (YM) 56.7 70.0 62.7 55.3 66.5 60.4 50.6 57.0 53.6 49.3 54.9 51.9
4 Base+Noun Pairs (WP) 57.5 70.6 63.4 55.8 67.4 61.1 51.6 57.6 54.4 49.7 55.4 52.4
5 Base+FrameNet (FN) 56.4 70.9 62.8 54.9 67.5 60.5 50.5 57.5 53.8 48.8 55.1 51.8
6 Base+Verb Pairs (VP) 56.9 71.3 63.3 55.2 67.6 60.8 50.7 57.9 54.0 49.0 55.4 52.0
7 Base+Appositives (AP) 56.9 70.0 62.7 55.6 66.9 60.7 50.3 57.1 53.5 49.1 55.1 51.9
Results for the Cluster-Ranking Model
8 Base 61.7 71.2 66.1 59.6 68.8 63.8 53.4 59.2 56.2 51.1 57.3 54.0
9 Base+YAGO Types (YT) 63.5 72.4 67.6 61.7 70.0 65.5 54.8 60.6 57.6 52.4 58.9 55.4
10 Base+YAGO Means (YM) 62.0 71.4 66.4 59.9 69.1 64.1 53.9 59.5 56.6 51.4 57.5 54.3
11 Base+Noun Pairs (WP) 64.1 73.4 68.4 61.3 70.1 65.4 55.9 62.1 58.8 53.5 59.1 56.2
12 Base+FrameNet (FN) 61.8 71.9 66.5 59.8 69.3 64.2 53.5 60.0 56.6 51.1 57.9 54.3
13 Base+Verb Pairs (VP) 62.1 72.2 66.8 60.1 69.3 64.4 54.4 60.1 57.1 51.9 58.2 54.9
14 Base+Appositives (AP) 63.1 71.7 67.1 60.5 69.4 64.6 54.1 60.1 56.9 51.9 57.8 54.7
Table 1: Results obtained by applying different types of features in isolation to the Baseline system
B 3
CEAF
Results for the Mention-Pair Model
1 Base 56.5 69.7 62.4 54.9 66.3 60.0 50.4 56.7 53.3 48.9 54.5 51.5
2 Base+YT 57.3 70.3 63.1 58.7 67.5 62.8 51.7 57.9 54.6 50.3 55.6 52.8
3 Base+YT+YM 57.8 70.9 63.6 59.1 67.9 63.2 52.1 58.3 55.0 50.8 56.0 53.3
4 Base+YT+YM+WP 59.5 71.9 65.1 57.5 69.4 62.9 53.1 59.2 56.0 51.5 57.1 54.1
5 Base+YT+YM+WP+FN 59.6 72.1 65.3 57.2 69.7 62.8 53.1 59.5 56.2 51.3 57.4 54.2
6 Base+YT+YM+WP+FN+VP 59.9 72.5 65.6 57.8 70.0 63.3 53.4 59.8 56.4 51.8 57.7 54.6
7 Base+YT+YM+WP+FN+VP+AP 59.7 72.4 65.4 57.6 69.8 63.1 53.2 59.8 56.3 51.5 57.6 54.4
Results for the Cluster-Ranking Model
8 Base 61.7 71.2 66.1 59.6 68.8 63.8 53.4 59.2 56.2 51.1 57.3 54.0
9 Base+YT 63.5 72.4 67.6 61.7 70.0 65.5 54.8 60.6 57.6 52.4 58.9 55.4
10 Base+YT+YM 63.9 72.6 68.0 62.1 70.4 66.0 55.2 61.0 57.9 52.8 59.1 55.8
11 Base+YT+YM+WP 66.1 75.4 70.4 62.9 72.4 67.3 57.7 64.4 60.8 55.1 61.6 58.2
12 Base+YT+YM+WP+FN 66.3 75.1 70.4 63.1 72.3 67.4 57.3 64.1 60.5 54.7 61.2 57.8
13 Base+YT+YM+WP+FN+VP 66.6 75.9 70.9 63.5 72.9 67.9 57.7 64.4 60.8 55.1 61.6 58.2
14 Base+YT+YM+WP+FN+VP+AP 66.4 75.7 70.7 63.3 72.9 67.8 57.6 64.3 60.8 55.0 61.5 58.1
Table 2: Results obtained by adding different types of features incrementally to the Baseline system
OntoNotes annotations as the gold standard As we
can see, the MP model achieves F-measure scores of
62.4 (B3) and 60.0 (CEAF) on ACE and 53.3 (B3)
and 51.5 (CEAF) on OntoNotes, and the CR model
achieves F-measure scores of 66.1 (B3) and 63.8
(CEAF) on ACE and 56.2 (B3) and 54.0 (CEAF)
on OntoNotes Also, the results show that the CR
model is stronger than the MP model, corroborating
previous empirical findings (Rahman and Ng, 2009)
4.2.2 Incorporating World Knowledge
Next, we examine the usefulness of world
knowl-edge for coreference resolution The remaining rows
in Table 1 show the results obtained when different types of features encoding world knowledge are ap-plied to the Baseline system in isolation The best result for each combination of data set, evaluation measure, and coreference model is boldfaced
Two points deserve mention First, each type
of features improves the Baseline, regardless of the coreference model, the evaluation measure, and the annotation scheme used This suggests that all these feature types are indeed useful for coreference reso-lution It is worth noting that in all but a few cases involving the FrameNet-based and appositive-based features, the rise in F-measure is accompanied by a
821
Trang 91. The Bush White House is breeding non-duck ducks the same way the Nixon White House did: It hops on an issue that is unopposable – cleaner air, better treatment of the disabled, better child care The President came
up with a good bill, but now may end up signing the awful bureaucratic creature hatched on Capitol Hill.
2. The tumor, he suggested, developed when the second, normal copy also was damaged He believed colon cancer might also arise from multiple “hits” on cancer suppressor genes, as it often seems to develop in stages.
Table 3: Examples errors introduced by YAGO and FrameNet
simultaneous rise in recall and precision This is
per-haps not surprising: as the use of world knowledge
helps discover coreference links, recall increases;
and as more (relevant) knowledge is available to
make coreference decisions, precision increases
Second, the feature types that yield the best
im-provement over the Baseline are YAGO TYPE and
Noun Pairs When the MP model is used, the best
coreference system improves the Baseline by 1–
1.3% (B3) and 1.3–2.8% (CEAF) in F-measure On
the other hand, when the CR model is used, the best
system improves the Baseline by 2.3–2.6% (B3) and
1.7–2.2% (CEAF) in F-measure
Table 2 shows the results obtained when the
dif-ferent types of features are added to the Baseline one
after the other Specifically, we add the feature types
in this order: YAGO TYPE, YAGO MEANS, Noun
Pairs, FrameNet, Verb Pairs, and Appositives In
comparison to the results in Table 1, we can see that
better results are obtained when the different types
of features are applied to the Baseline in
combina-tion than in isolacombina-tion, regardless of the coreference
model, the evaluation measure, and the annotation
scheme used The best-performing system, which
employs all but the Appositive features, outperforms
the Baseline by 3.1–3.3% in F-measure when the
MR model is used and by 4.1–4.8% in F-measure
when the CR model is used In both cases, the
gains in F-measure are accompanied by a
simulta-neous rise in recall and precision Overall, these
results seem to suggest that the CR model is
mak-ing more effective use of the available knowledge
than the MR model, and that the different feature
types are providing complementary information for
the two coreference models
4.3 Example Errors
While the different types of features we considered
improve the performance of the Baseline primarily
via the establishment of coreference links, some of these links are spurious Sentences 1 and 2 of Table
3 show the spurious coreference links introduced by the CR model when YAGO and FrameNet are used,
respectively In sentence 1, while The President and
Bush are coreferent, YAGO caused the CR model
to establish the spurious link between The President and Nixon owing to the proximity of the two NPs
and the presence of this NP pair in the YAGO TYPE
relation In sentence 2, FrameNet caused the CR
model to establish the spurious link between The
tu-mor and colon cancer because these two NPs are the
ARG0 arguments of develop and arise, which appear
in the same semantic frame in FrameNet
5 Conclusions
We have examined the utility of three major sources of world knowledge for coreference resolu-tion, namely, large-scale knowledge bases (YAGO, FrameNet), coreference-annotated data (Noun Pairs, Verb Pairs), and unannotated data (Appositives), by applying them to two learning-based coreference models, the mention-pair model and the cluster-ranking model, and evaluating them on documents annotated with the ACE and OntoNotes annotation schemes When applying the different types of fea-tures in isolation to a Baseline system that does not employ world knowledge, we found that all of them improved the Baseline regardless of the underlying coreference model, the evaluation measure, and the annotation scheme, with YAGO TYPE and Noun Pairs yielding the largest performance gains Nev-ertheless, the best results were obtained when they were applied in combination to the Baseline system
We conclude from these results that the different fea-ture types we considered are providing complemen-tary world knowledge to the coreference resolvers, and while each of them provides fairly small gains, their cumulative benefits can be substantial
822
Trang 10We thank the three reviewers for their invaluable
comments on an earlier draft of the paper This work
was supported in part by NSF Grant IIS-0812261
References
Amit Bagga and Breck Baldwin 1998 Algorithms for
scoring coreference chains In Proceedings of the
Lin-guistic Coreference Workshop at The First
Interna-tional Conference on Language Resources and
Eval-uation, pages 563–566.
Collin F Baker, Charles J Fillmore, and John B Lowe.
1998 The Berkeley FrameNet project In
Proceed-ings of the 36th Annual Meeting of the Association for
Computational Linguistics and the 17th International
Conference on Computational Linguistics, Volume 1,
pages 86–90.
David Bean and Ellen Riloff 2004 Unsupervised
learn-ing of contextual role knowledge for coreference
reso-lution In Proceedings of the Human Language
Tech-nology Conference of the North American Chapter of
the Association for Computational Linguistics, pages
297–304.
Eric Bengtson and Dan Roth 2008 Understanding the
values of features for coreference resolution In
Pro-ceedings of the 2008 Conference on Empirical
Meth-ods in Natural Language Processing, pages 294–303.
Volha Bryl, Claudio Giuliano, Luciano Serafini, and
Kateryna Tymoshenko 2010 Using background
knowledge to support coreference resolution In
Pro-ceedings of the 19th European Conference on Artificial
Intelligence, pages 759–764.
Eugene Charniak 1972 Towards a Model of Children’s
Story Comphrension. AI-TR 266, Artificial
Intelli-gence Laboratory, Massachusetts Institute of
Technol-ogy.
Aron Culotta, Michael Wick, and Andrew McCallum.
2007 First-order probabilistic models for coreference
resolution In Human Language Technologies 2007:
The Conference of the North American Chapter of the
Association for Computational Linguistics;
Proceed-ings of the Main Conference, pages 81–88.
Hal Daum´e III and Daniel Marcu 2005 A large-scale
exploration of effective global features for a joint
en-tity detection and tracking model In Proceedings of
Human Language Technology Conference and
Confer-ence on Empirical Methods in Natural Language
Pro-cessing, pages 97–104.
Pascal Denis and Jason Baldridge 2008 Specialized
models and ranking for coreference resolution In
Pro-ceedings of the 2008 Conference on Empirical
Meth-ods in Natural Language Processing, pages 660–669.
Jenny Rose Finkel, Trond Grenager, and Christopher Manning 2005 Incorporating non-local informa-tion into informainforma-tion extracinforma-tion systems by Gibbs
sam-pling In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages
363–370.
Michael Fleischman, Eduard Hovy, and Abdessamad Echihabi 2003 Offline strategies for online ques-tion answering: Answering quesques-tions before they are
asked In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages
1–7.
Aria Haghighi and Dan Klein 2009 Simple coreference resolution with rich syntactic and semantic features.
In Proceedings of the 2009 Conference on Empiri-cal Methods in Natural Language Processing, pages
1152–1161.
Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel 2006 Ontonotes:
The 90% solution In Proceedings of the Human Lan-guage Technology Conference of the NAACL, Com-panion Volume: Short Papers, pages 57–60.
Thorsten Joachims 2002 Optimizing search engines
us-ing clickthrough data In Proceedus-ings of the Eighth ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pages 133–142.
Xiaoqiang Luo, Abe Ittycheriah, Hongyan Jing, Nanda Kambhatla, and Salim Roukos 2004 A mention-synchronous coreference resolution algorithm based
on the Bell tree In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguis-tics, pages 135–142.
Xiaoqiang Luo 2005 On coreference resolution
perfor-mance metrics In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 25–
32.
Ruslan Mitkov 2002 Anaphora Resolution Longman.
Vincent Ng and Claire Cardie 2002 Improving machine
learning approaches to coreference resolution In Pro-ceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 104–111.
Vincent Ng 2007 Shallow semantics for coreference resolution. In Proceedings of the Twentieth Inter-national Joint Conference on Artificial Intelligence,
pages 1689–1694.
Simone Paolo Ponzetto and Massimo Poesio 2009 State-of-the-art NLP approaches to coreference
reso-lution: Theory and practical recipes In Tutorial Ab-stracts of ACL-IJCNLP 2009, page 6.
Simone Paolo Ponzetto and Michael Strube 2006 Exploiting semantic role labeling, WordNet and
Wikipedia for coreference resolution In Proceedings
823