Its objectives are to pro- mote development and spread of N L P / A I applica- tions to render GDA-tagged documents versatile and intelligent contents, which should nmtivate W W W World
Trang 1A u t o m a t i c T e x t S u m m a r i z a t i o n B a s e d o n
t h e G l o b a l D o c u m e n t A n n o t a t i o n Katashi Nagao
S o n y C o m p u t e r S c i e n c e L a b o r a t o r y I n c
3 - 1 4 - 1 3 H i g a s h i - g o t a n d a , S h i n a g a w a - k u ,
T o k y o 1 4 1 - 0 0 2 2 , J a p a n
n a g a o ~ c s l s o n y c o j p
KSiti Hasida
E l e c t r o t e c h n i c a l L a b o r a t o r y
1 - 1 - 4 U m e z o n o , T u k u b a ,
I b a r a k i 3 0 5 - 8 5 6 8 , J a p a n
h a s i d a @ e t l g o j p
A b s t r a c t The GDA (Global Document Annotation) project
proposes a tag set which allows machines to auto-
matically infer the underlying semantic/pragmatic
structure of documents Its objectives are to pro-
mote development and spread of N L P / A I applica-
tions to render GDA-tagged documents versatile and
intelligent contents, which should nmtivate W W W
(World Wide Web) users to tag their documents as
part of content authoring This paper discusses au-
tomatic text summarization based on GDA Its main
features are a domain/style-free algorithm and per-
sonalization on summarization which reflects read-
ers' interests and preferences In order to calcu-
late the importance score of a text element, the
algorithm uses spreading activation on an intra-
document network which connects text elements via
thematic, rhetorical, mid coreferential relations The
proposed method is flexible enough to dynamically
generate summaries of various sizes A summary
browser supporting personalization is reported as
well
1 I n t r o d u c t i o n
The W W W has opened up an era in which an un-
restricted number of people publish their messages
electronically through their online documents How-
ever, it is still very hard to automatically process
contents of those documents T h e reasons include
the following:
1 H T M L (HyperText Markup Language) tags
mainly specify the physical layout of docu-
ments T h e y address very few content-related
annotations
2 Hypertext links cannot very nmch help readers
recognize the content of a document
3 The W W W authors tend to be less careful
about wording and readability than in tradi-
tional printed media Currently there is no sys-
tematic means for quality control in the W W W
Although H T M L is a flexible tool that allows you
to freely write and read messages on the W W W , it
is neither very convenient to readers nor suitable for automatic processing of contents
We have been developing an integrated platform for document authoring, publishing~ and reuse by combining natural language and W W W technolo- gies As the first step of our project, we defined a new tag set and developed tools for editing tagged texts and browsing these texts T h e browser has the functionality of summarization and content-based retrieval of tagged documents
This paper focuses on summarization based on this system The main features of our summariza- tion method are a domain/style-free algorithm and personalization to reflect readers" interests and pref- erences This method naturally outperforms the tra- ditional summarization methods, which just pick out sentences highly scored on the basis of superficial clues such as word count, and so on
2 G l o b a l D o c u m e n t A n n o t a t i o n GDA (Global Document Annotation) is a chal- lenging project to make W W W texts machine- understandable on the basis of a new tag set, and to develop content-based presentation, retrieval question-answering, summarization, and translation systems with much higher quality than before GDA thus proposes an integrated global platform for elec- tronic content authoring, presentation, and reuse The GDA tag set is based on XML (Extensible Markup Language), and designed as compatible as possible with HTML, TEI, EAGLES, and so forth
An example of a GDA-tagged sentence is as follows:
<su><np sem=timeO>time</np>
<vp><v s e m = f l y l > f l i e s < / v >
<adp><ad sem=likeO>like</ad> <np>an
<n sem=arrowO>arrow</n></np>
</adp></vp> </su>
<su> means sentential unit
<n> <np> <v>, <vp> <ad> and <adp> m e a n noun
Trang 2noun phrase, verb, verb phrase, adnoun or a d v e r b
(including preposition and postposition), and ad-
nonfinal or adverbial phrase, respectively 1
T h e G D A initiative aims at having m a n y W W W
authors a n n o t a t e their on-line documents with this
c o m m o n tag set so t h a t machines can automatically
recognize the underlying semantic and p r a g m a t i c
structures of those d o c u m e n t s much nmre easily
t h a n by analyzing traditional H T M L files A huge
a m o u n t of a n n o t a t e d d a t a is expected to emerge,
which should serve not just as tagged linguistic cor-
p o r a but also as a worldwide, self-extending knowl-
edge base, mainly consisting of examples showing
how our knowledge is manifested
G D A has three main steps:
1 Propose an X M L tag set which allows machines
to automatically infer the underlying structure
of documents
2 P r o n m t e development and spread of N L P / A I
applications to turn tagged texts to versatile
and intelligent contents
3 Motivate thereby the authors of W W W files to
a n n o t a t e their d o c u m e n t s using those tags
2.1 T h e m a n t i c / R h e t o r i c a l R e l a t i o n s
T h e t e l a t t r i b u t e encodes a relationship in which
the current element stands with respect to the ele-
ment t h a t it semantically depends on Its value is
called a relational term A relational t e r m denotes a
binary relation, which m a y be a thematic role such
as agent, patient, recipient, etc., or a rhetorical rela-
tion such as cause, concession, etc Thus we conflate
t h e m a t i c roles and rhetorical relations here, because
the distinction between t h e m is often vague For in-
stance, c o n c e s s i o n m a y be b o t h intrasentential and
intersentential relation
Here is an example of a r e 1 attribute:
<su ctyp=fd><name rel=agt>Tom</name>
<vp>came</vp> </su>
c t y p = f d means t h a t the first element
<name rel=agt>Tom</name> depends on the second
element <vp>came</vp> r e l = a g t means t h a t T o m
has the agent role with respect to the event denoted
by c a m e
r e 1 is an open-class a t t r i b u t e , potentially encom-
passing all the binary relations lexicalized in nat-
ural languages An exhaustive listing of t h e m a t i c
roles and rhetorical relations a p p e a r s impossible, as
widely recognized We are not yet sure a b o u t how
1A more detailed description of the GDA tag set can be
found at http ://~w etl go jp/etl/nl/GDA/tagset, html
m a n y t h e m a t i c roles and rhetorical relations are suf- ficient for engineering applications However the
a p p r o p r i a t e granulal~ty of classification will be de- termined by the current level of technology
2.2 A n a p h o r a a n d C o r e f e r e n c e
Each element m a y have an identifier as the value of the i d attribute Anaphoric expression should have the aria a t t r i b u t e with its a n t e c e d e n t ' s i d value An example follows:
<name id=l>John</name> beats
<adp ana=l>his</adp> dog
A non-anaphoric coreference is m a r k e d by the c r f attribute, whose usage is the same as the a n a at- tl~bute
When the coreference is at the level of t y p e (kind sort, etc.) which the referents of the antecedent and the a n a p h o r are tokens of, we use the c o t y p
a t t r i b u t e as below:
You bought <np id=ll>a car</np>
I bought <np cotyp=ll>one</np>, too
A zero a n a p h o r a is encoded by using the appro- priate relational t e r m as an a t t r i b u t e name with the referent's id value Zero anaphors of compulsory el- ements, which describe the internal structure of the events represented by the verbs of adjectives are re- quired to be resolved Zero a n a p h o r s of optional ele- ments such as with reason and means roles m a y not Here is an e x a m p l e of a zero a n a p h o r a concerning
an optional t h e m a t i c role b e n (for beneficiary):
Tom visited <name id=lll>Mary</name>
He <v ben=111>brought</v> a present
3 T e x t S u m m a r i z a t i o n
As an example of a basic application of GDA, we have developed an a u t o m a t i c text s u m m a r i z a t i o n system S u m m a r i z a t i o n generally requires deep se- mantic processing and a lot of background knowl- edge However, nmst previous works use several su- perficial clues and heuristics on specific styles or con- figurations of d o c u m e n t s to summarize
For example, clues for determining the i m p o r t a n c e
of a sentence include (1) sentence length, (2) key- word count, (3) tense, (4) sentence type (such as fact, conjecture and assertion), (5) rhetorical rela- tion (such as reason and example), and (6) position
of sentence in the whole text Most of these are ex- tracted by a shallow processing of the text Such a
c o m p u t a t i o n is rather robust
Present s u m m a r i z a t i o n s y s t e m s ( W a t a n a b e , 1996: Hovy and Lin, 1997) use such clues to calculate an
i m p o r t a n c e score for each sentence, choose sentences
Trang 3a c c o r d i n g to t h e score, a n d s i m p l y p u t t h e s e l e c t e d
s e n t e n c e s t o g e t h e r in o r d e r of t h e i r o c c u r r e n c e s in
t h e o r i g i n a l d o c u m e n t In a sense, t h e s e s y s t e m s a r e
successful e n o u g h to be p r a c t i c a l , a n d a r e b a s e d on
r e l i a b l e t e c h n o l o g i e s However, t h e q u a l i t y of s u m -
m a r i z a t i o n c a n n o t b e i m p r o v e d b e y o n d this b a s i c
level w i t h o u t a n y d e e p c o n t e n t - b a s e d p r o c e s s i n g
W e p r o p o s e a new s u m m a r i z a t i o n m e t h o d b a s e d
on G D A T h i s m e t h o d e m p l o y s a s p r e a d i n g a c t i v a -
tion t e c h n i q u e ( H a s i d a et al., 1987) to c a l c u l a t e t h e
i m p o r t a n c e v a l u e s of e l e m e n t s in t h e t e x t Since t h e
m e t h o d d o e s n o t e m p l o y a n y h e u r i s t i c s d e p e n d e n t on
t h e d o m a i n a n d s t y l e of d o c u m e n t s , i t is a p p l i c a b l e
to a n y G D A - t a g g e d d o c u m e n t s T h e m e t h o d also
can t r i m s e n t e n c e s in t h e s u m m a r y b e c a u s e i m p o r -
t a n c e scores a r e a s s i g n e d to e l e m e n t s s m a l l e r t h a n
sentences
A G D A - t a g g e d d o c u m e n t n a t u r a l l y defines a n
i n t r a - d o c u m e n t n e t w o r k in which n o d e s corre-
s p o n d to e l e m e n t s a n d links r e p r e s e n t t h e s e m a n -
tic r e l a t i o n s m e n t i o n e d in t h e p r e v i o u s section
T h i s n e t w o r k c o n s i s t s of s e n t e n c e t r e e s ( s y n t a c t i c
h e a d - d a u g h t e r h i e r a r c h i e s of s u b s e n t e n t i a l e l e m e n t s
such as w o r d s o r p h r a s e s ) , c o r e f e r e n c e / e m a p h o r a
links, d o c u m e n t / s u b d i v i s i o n / p a r a g r a p h n o d e s , a n d
r h e t o r i c a l r e l a t i o n links
F i g u r e 1 shows a g r a p h i c a l r e p r e s e n t a t i o n of t h e
i n t r a - d o c u m e n t n e t w o r k
document
subdivision ~ / ~ v
/l \
paragraph /¢J% U U U U U • * *
(optional) / ~ _
sentence ~ / \ ~ ~ / ~ n t
s u b s e n t e n t i a l ( ~ l l ' ~ l l ( ~ 3 ~ (~3 ~ link
segment j~% "~ ~ /~ -~ ref
.
link
F i g u r e 1: I n t r a - D o c u m e n t N e t w o r k
T h e s u m m a l i z a t i o n a l g o r i t h m is t h e following:
1 S p r e a d i n g a c t i v a t i o n is p e r f o r m e d in s u c h a
w a y t h a t t w o e l e m e n t s h a v e t h e s a m e a c t i v a -
t i o n v a l u e if t h e y a r e c o r e f e r e n t o r One of t h e m
is t h e s y n t a c t i c h e a d of t h e o t h e r
2 T h e u n m a r k e d e l e m e n t w i t h t h e h i g h e s t a c t i v a -
t i o n v a l u e is m a r k e d for i n c l u s i o n in t h e s u m -
m a r y
3 W h e n a n e l e m e n t is m a r k e d , o t h e r e l e m e n t s
l i s t e d b e l o w a r e r e c u r s i v e l y m a r k e d ms well, u n t i l
no m o r e e l e m e n t m a y b e m a r k e d
• its h e a d
• its a n t e c e d e n t
• its c o m p u l s o r y o r a priori i m p o r t a n t
d a u g h t e r s , t h e v a l u e s of w h o s e r e l a t i o n a l
a t t r i b u t e s a r e a g t p a t o b j p o s , c n t , c a u , end, sbra, a n d so forth
• t h e a n t e c e d e n t of a zero a n a p h o r in it w i t h
s o m e of t h e a b o v e v a l u e s for t h e r e l a t i o n a l
a t t r i b u t e
4 All m a r k e d e l e m e n t s in t h e i n t r a - d o c m n e n t net-
w o r k a r e g e n e r a t e d p r e s e r v i n g t h e o r d e r of t h e i r
p o s i t i o n s in t h e o r i g i n a l d o c u m e n t
5 I f a size of t h e s u n n n a r y r e a c h e s t h e user- specified v a l u e , t h e n t e r n f i n a t e ; o t h e r w i s e go
b a c k to S t e p 2
T h e following a r t i c l e of t h e W a l l S t r e e t J o u r n a l
w a s u s e d for t e s t i n g t h i s a l g o r i t h m During its centennial year The Wall Street Journal will report events of the past century
t h a t stand as milestones of American busi- ness history T H R E E C O M P U T E R S T H A T
C H A N G E D the face of personal computing were launched in 1977 T h a t year the Ap- ple II Commodore Pet and ' r a n d y TRS came
to market The computers were crude by to- day's stmldards Apple II owners, for exam- ple had to use their television sets as screens and stored d a t a on audiocassettes But A p p l e
II was a m a j o r advance from A p p l e I, which was built in a garage by Stephen Wozniak and Steven Jobs for hobbyists such as the Home- brew Computer Club In addition, the Ap- ple II was an affordable $1,298 Crude as they were, these early PCs triggered explosive product development in desktop models for the home and office Big mainframe computers for business had been around for years But the new 1977 PCs - unlike earlier built-from-kit types such as the Altair, Sol and I M S A I - had keyboards and could store about two pages of
d a t a in their memories Current PCs are more than 50 tinms faster and have m e m o r y capac- ity 500 times greater than their 1977 counter- parts There were m a n y pioneer P C contrib- utors William Gates and Paul Allen in 1975 developed an early language-housekeeper sys- tem for PCs, and Gates became an industry billionaire six years after IBM a d a p t e d one of these versions in 1981 Alan F Shugart, cur- rently chairman of Seagate Technology, led the team that developed the disk drives for PCs Dennis Hayes and Dale Heatherington, two At- lanta engineers, were co-developers of the in- ternal modems t h a t allow PCs to share d a t a via the telephone IBM, the world leader in computers, d i d n ' t offer its first PC until Au- gust 1981 as many other companies entered the
Trang 4market Today PC shipments annually total
some $38.3 billion world-wide
Here is a short, computer-generated s u m m a r y of
this sample article:
C H A N G E D the face of personal computing
were launched Crude as they were, these
early PCs triggered explosive p r o d u c t de-
velopment C u r r e n t P C s are more t h a n 50
times faster and have m e m o r y capacity 500
times greater t h a n their counterparts
T h e proposed m e t h o d is flexible enough to dy-
nmnically generate summaries of various sizes If a
longer s u m m a r y is needed, the user can change the
window size of the s u m m a r y browser, as described
in Section 3.1 Then the sumnlary changes its size
to fit into the new window An example of a longer
s u m m a r y follows:
C H A N G E D the face of personal c o m p u t -
ing were launched T h e Apple II, Com-
nlodore P e t and T a n d y T R S came to mar-
ket T h e c o m p u t e r s were crude Apple II
owners had to use their television sets and
stored d a t a on audiocassettes The Ap-
ple II was an affordable $1.298 Crude as
they were, these early P C s triggered explo-
sive product development T h e new P C s
had keyboards and could store a b o u t two
pages of d a t a in their memories C u r r e n t
P C s are more t h a n 50 times faster and have
memo~T capacity 500 times greater t h a n
their counterparts There were m a n y pi-
oneer P C contributors William G a t e s and
Paul Allen developed an early language-
housekeeper system, and Gates became an
industry billionaire after IBM a d a p t e d one
of these versions IBM didn't offer its first
PC
An observation obtained from this experiment is
t h a t tags for coreferences and thematic and rhetori-
cal relations are almost enough to make a s u m m a r y
In particular, coreferences and rhetorical relations
help s u m m a r i z a t i o n very much
G D A tags allow us to apply more sophisticated
n a t u r a l language processing technologies to come up
with b e t t e r summaries It is straightforward to in-
c o r p o r a t e sentence generation technologies to para-
phrase parts of the document, rather t h a n just se-
lecting or pruning them Annotations on a n a p h o r a
can be exploited to produce context-dependent para-
phrases Also the s u m m a r y could be itemized to fit
in a slide presentation
3 1 S u m m a r y B r o w s e r
We developed a s u m m a r y browser using a Java- capable W W W browser Figure 2 shows an example screen of the s u m m a r y browser
During its centennial year The Wall Street Journal will report events ol the past century that stand its milestones of American business history THREE COMRJTERS THAT CHANGED the
! face of personal computing were launched in | 977 That year the Apple II, Commodore Pet and Tandy TRS came to market The computers were crude by today's standards Apple U owners, for ~¢ample, had to use their television sets as scfeens and stored data on
i audiocasset t es But Apple II was a rllajof advance horn Apple I, which was built in a garage by
t Stephan Wozniak and Stevan Jobs for hobbyists such as the Homebrew Computer Club+ In addition, the Apple n was an affordable $ 1 , 2 9 8 Crude as they were, these early I~:s trl "ggered e~plo~ve product development in desktop models for the home and office_ B/g
mainlrame co~nput ers for business had been around for yeats But the ~ 1977 PCs unlike eadier built-from-kit types such as the Altair, Sol and IMSAI - had keyboards and could store
about t w o pages of data in their memories Current PCs are more than 50 times faster and
t have memory capacity SO0 times greater than their 1977 counteq~acts There were many pioneer PC contributors W~lliam Gates and Paul Allen in 197S devdoged an early language-housek e e p ~ system for PCS, and Gates became an industry billionaire six years alter IBM adapted one of these versions in 1981 Alan F Sbugart, currently chairman ol' Seagate Technology, led the team that developed the disk drives for PCs Dennis Hayes and Dale Heatheriagton, t w o Atlanta engineers, were co-devolopef~ of the internal moderns that allow PCs to share data via the telephone IBM, the w o d d leader in computers, didn't offer its
f~s'lr PC lunta Al/nll~t 1 qR1 =¢ m ~ m nthtl¢ r n m n l n i ~ ~ntmt=~l th~ mlr~at Tnd=u P~ ~, THREE" COMPUTERS THAT CHANGED the face of personal computing were launched Crude as
i they were, these early PCs tnggered e~plosive product development Current PCs aee mote
! than 50 times taster and have memory capacity SO0 times greater than their counterparts
I
Figure 2: S u m m a r y Browser
It has the following functionalities:
1 A screen is divided into three parts (frames) One frame provides a user input form through which you can select d o c u m e n t s and type key- words T h e other frames are for displaying the original d o c u m e n t and its s u m m a r y
2 T h e frame for the s u m m a r y text is resizable
by sliding the b o u n d a r y with the original doc-
u m e n t frame T h e size of the s u m m a r y frame influences the size of the s u m m a r y itself T h u s you can see the s u m m a r y in a preferred size and change the size in an easy and intuitive way
3 T h e f r a m e for the original d o c u m e n t is mouse sensitive You can select any element of text in this frame This function is used for the cus- tomization of the s u m m a r y , as described later
4 H T M L tags are also handled by the browser
So, images are viewed and hyperlinks are nian- aged b o t h in the s u m m a r y If a hyperlink
is clicked in the original d o c u m e n t frame, the linked d o c u m e n t a p p e a r s on the same frame
T h e hyperlinks are kept in the summary
4 P e r s o n a l i z a t i o n
A good s u m m a r y might depend on the background knowledge of its creator It, also should change ac-
Trang 5cording to the interests or preferences of its reader
Let us refer to the a d a p t a t i o n of the summariza-
tion process to a particular user as personalization
GDA-based s u m m a r i z a t i o n can be easily personal-
ized because our m e t h o d is flexible enough to bias
lect any elements in the original d o c u m e n t during
summarization, to interactively provide information
concerning y o u r personal interests
We have been developing the following techniques
for personalized summarization:
The user can input any words of interest
The system relates those words with those in
the d o c u m e n t using cooccurrence statistics ac-
quired from a corpus and a dictionary such as
WordNet (Miller, 1995) T h e related words in
the d o c u m e n t are assigned numeric values t h a t
reflect closeness to the input words These val-
ues are used in spreading activation for calcu-
lating i m p o r t a n c e scores
• Interactive custonfization by selecting any ele-
ments from a d o c u m e n t
T h e user can m a r k any words, phrases, and sen-
tences to be included in the s u m m a r y T h e sum-
m a t t browser allows the user to select those el-
ements by pointing devices such as mouse and
stylus pen T h e user can easily select elements
by clicking on them T h e click count corre-
sponds to the level of elements T h a t is, the
first click means the word, the second the next
larger element containing it, and so on T h e se-
lected elements will have higher activation val-
ues in spreading activation
• Learning user interests by observation of W W W
browsing
T h e s u m m m i z a t i o n system can customize the
s u m m a r y according to the user without any ex-
plicit user inputs We implemented a learning
mechanism for user personalization T h e mech-
anism uses a weighted feature vector T h e fea-
ture corresponds to the category or topic of doc-
uments T h e category is defined according to a
W W W directory such as Yahoo T h e topic is
detected using the s u m m a r i z a t i o n technique
Learning is roughly divided into d a t a acquisi-
tion and model nmdification T h e user's behav-
ioral d a t a is acquired by detecting her informa-
tion access on the W W W This d a t a includes
the time and duration of t h a t information ac-
cess and features related to t h a t information
The first step of model modification is to esti-
m a t e the degree of relevance between the input
feature vector assigned to the information ac- cessed by the user and the model of the user's interests acquired fl'om previous data T h e sec- ond step is to adjust the weights of features in the user model
5 Concluding R e m a r k s
We have discussed the G D A project, which aims at supporting versatile and intelligent contents Our focus in the present p a p e r is one of its applications
to a u t o m a t i c text summarization We are evaluating our summarization m e t h o d using online Japanese ar- ticles with G D A tags We are also extending text summarization to t h a t of hypertext For example, a
s m n m a r y of a h y p e r t e x t d o c u m e n t will include re- cursively embedding linked d o c u m e n t s in summary, which should be useful for encyclopedic entries, too Future work includes construction of a large-scale
G D A corpus a n d system evaluation by open exper- imentation G D A tools including a tagging editor and a browser will soon be publicly available on the
W W W Our m a i n current concern is interactive and intelligent presentation, as an extension of text sum- marization This m a y turn out to be a killer appli- cation of GDA because it does not just presuppose rather small a m o u n t of tagged d o c u m e n t but also makes the effect of tagging i m m e d i a t e l y visible to the author We hope t h a t our project revolutionize global and intercultural communications
R e f e r e n c e s
K6iti Hasida, Syun Ishizaki, and Hitoshi Isahara
1987 A connectionist approach to the generation
of abstracts In Gerard K e m p e n , editor Natural Language Generation: New Results in Artificial
149-156 Martinus Nijhoff
E d u a r d Hovy and Chin Yew Lin 1997 A u t o m a t e d text summaxization in S U M M A R I S T In Proceed- ings o/ A CL Workshop on Intelligent Scalable Text Summarization
George Miller 1995 WordNet: A lexical database
38(11):39-41
Hideo W a t a n a b e 1996 A m e t h o d for abstract- ing newspaper articles by using surface clues In
Proceedings o/ the Sixteenth International Con- ference on Computational Linguistics (COLING-