VnLoc: A Real–time News Event Extraction Framework for VietnameseMai-Vu Tran∗ vutm@vnu.edu.vn Minh-Hoang Nguyen∗ hoangnm_53@vnu.edu.vn Sy-Quan Nguyen∗ quanns_53@vnu.edu.vn Minh-Tien Nguy
Trang 1VnLoc: A Real–time News Event Extraction Framework for Vietnamese
Mai-Vu Tran∗ vutm@vnu.edu.vn
Minh-Hoang Nguyen∗ hoangnm_53@vnu.edu.vn
Sy-Quan Nguyen∗ quanns_53@vnu.edu.vn Minh-Tien Nguyen∗∗
tiennm@utehy.edu.vn
Xuan-Hieu Phan∗ hieupx@vnu.edu.vn
∗KTLab, Faculty of Information Technology, College of Technology
Vietnam National University, Hanoi (VNU)
Hanoi, Vietnam
∗∗ Faculty of Information Technology, Hung Yen University of Technology and Education, Hungyen, Vietnam
Abstract
Event Extraction is a complex and interesting topic in
Information Extraction that includes event extraction
meth-ods from free text or web data The result of event extraction
systems can be used in several fields such as risk analysis
systems, online monitoring systems or decide support tools
[4] In this paper, we introduce a method that combines
lexico–semantic and machine learning to extract event from
Vietnamese news Furthermore, we concentrate to describe
event online monitoring system named VnLoc based on the
method that was proposed above to extract event in
Viet-namese language Besides, in experiment phase, we have
evaluated this method based on precision, recall and F1
measure At this time of experiment, we on investigated on
three types of event: FIRE, CRIME and TRANSPORT
AC-CIDENT.
1 Introduction
The information explosion and development of
Infor-mation Technology–Communication is good condition for
people reach information easily Therefore, information are
more and more rich and diversified Information from
dif-ferent sources (newspaper, blog, social network, ) is main
cause of chaos information Thus, extracting useful
infor-mation that reader interested in from the daily news is really
know news detail without reading entire news content In addition, result of extracting event can be used in online monitoring system where user can catch information eas-ily
Recently, event extraction topic has received more at-tention from scientists in Natural Language Processing and Data Mining around the world In 1987, event extraction have become a main topic in Message Understanding Con-ference (MUC) [5] In this conCon-ference, an event was
de-fined: an event must have actor, time, place and impact on the surrounding environment In addition, Automatic Con-tent Extraction program gave definition: event is an ac-tivity was created by participants and divided event into
eight types: Life, Movement, Transection, Business, Con-flict, Contact, Personnel and Justice As Allen’s definition:
an event includes four attributes: modality, polarity (Pos-itive, Negative), genericity (Specific, Generic), tense(Past, Present, Future, Unspecified) [1].
Based on investigation and analysis of meaning of event extraction, we have proposed a event extraction method for Vietnamese language and building event online monitoring system named VnLoc The method we proposed which is a combination between lexico–semantic and machine learn-ing Data of system are gathered from news through the RSS feeds Then, we apply our method which was proposed
to classify event into two categories: EVENT or NON-EVENT upon tiding’s title After that, we extract event at-tributes from events which are classified and the last result
2012 Fourth International Conference on Knowledge and Systems Engineering
Trang 2whereas section 3 will describe more detail our method and
event online monitoring system VnLoc Section 4 illustrates
our experiment and evaluates result on real data The last
section is conclusion
2 Related Work
In [7], Ralph Grishman et al investigated on Maximum
Entropy to detect event They used three classifiers for
in-dividual task which are argument classifier, role classifier,
event classifier Moreover, event coreference is also solved
by another Maximum Entropy classifier using features such
as the event type, the event subtype, the anaphor anchor and
the distance between anaphor and anchor In other study,
Heng Ji and Ralph Grishman explored Maximum Entropy
to identify events of a separate type [6] It is a sentence–
level classifier which processes each sentence in the
docu-ment and attempt determine event type
Lexico–semantic patterns can be used for various
pur-poses in many domains Cohen and Verspoor et al [3]
ap-plied semantic rules as patterns to extract event in
biologi-cal area They divided biologibiologi-cal events into six types:
bind-ing, gene expression, localization, phosphorylation,
pro-tein catebolism and transcription Biological events are
ex-tracted through patterns which each pattern is a set of
se-mantic words In other word, Jethro Borsje et al [2]
pro-posed using lexico–semantic patterns to detect financial
event from RSS news feeds These patterns were organized
in financial ontology named OWL Each pattern has a triple
format and includes three elements: subject, a relation and
optional subject.
In addition, there are several systems which extract
events from online news in other domain Collier et
al made BioCaster system where we can follow several
event types around the world (http://born.nii.ac.jp)
Be-sides, HealthMap system was built by Freifeld and
Brown-stein where user can monitor diseases types over the
world (http://www.healthmap.org) By the way, Frontex
system was developed by Atkinson and Piskoski et al
(http://frontex.europa.eu) for monitor Europe agency.
3 Implemented System
3.1 System Architecture
VnLoc is an event monitoring system that is horizontal
scalable and distributed Its architect is illustrated in figure
1 VnLoc consists of six components: a scalable data
repos-itory, a news crawler, an event detector, an event extractor, a
plugin engine and a visualizer as web–based
We organize the data repository into three parts: a news
database, an event database and a data corpus The first one
Table 1: Features Description
Feature χ2weight Freqt Semantic label
chém (cutting) 0.70329136 240 CRIME
giết (kill) 0.6890592 530 CRIME
cháy (fire) 0.5872597 201 FIRE
gây tai nạn(crashed) 0.5106312 374 TRANSPORT ACCIDENT
stores details information of news that is gathered The sec-ond one stores events information which are extracted by event extractor Both of them are organized in MongoDB system to attained the important key: high scalability The last one embraces several corpora which support for both machine learning process and extraction process Next, the news crawler fetches tidings through RSS resources which are supplied by many websites such as VnExpress1, Viet-NamNet2, DanTri3 By the structure of RSS format, useful information can be extracted from individual feed by a XML parser and be saved in news database in data repository Fur-thermore, the visualizer is described as a Map where shows event on web interface Data is pulled from event database and pushed to Google Map API with some modifications and will be represented Following, two main elements in-cluding event detector and event extractor will be explained
to make VnLoc system clearer
3.2 Event Detector
When a news is gathered, it is determined by the event detector to detect event inner news To settle this task, we used a binary calssification approach which is Maximum Entropy method We examined domain carefully and iden-tified that the most of news’ titles express their content ev-idently Therefore, our problem is sentence level classifi-cation The first job, set of features is chose based onχ2
weight on offline data that is gathered before Simultane-ously, N–grams method is also utilized to select phrases as features In this paper, we choose Uni–gram, Bi–grams and Tri–grams as three phrase types Moreover, feature is tagged with a semantic label to enhance its meaning The table 1 shows some examples After that, Maximum Entropy clas-sifier is applied to divides set of titles into two categories: EVENT and NON-EVENT This job is pre–condition for event extractor in the next phase
3.3 Event Extractor
In the second important part, event and its information such as time, place, participants will be extracted from news
1www.vnexpress.net
2www.vietnamnet.vn
3www.dantri.com
Trang 3Figure 1: VnLoc’s Architecture
which is predicted that contains circumstance by the event
detector Our approach is very clear and knowledge driven
A lot of rules would be generated and exploited on the
ru-mours which are passed from previous phase In this paper,
we use 7 types of rules for our aim
To take out event, the rules 1, 2, 3, 4 are applied:
• FIRE
< P RE > < F IRE > < P OST > (1)
• CRIME
< P RE > < CRIME > < P OST > (2)
• TRANSPORT ACCIDENT
< P RE > < ACCIDENT > < P OST > (3)
< P RE > < DAMAGE > < P OST > (4)
With PRE and POST are phrases or words surrounding
key-words
FIRE:= vụ cháy (fire)
bùng cháy (burning)
cháy rụi (burned)
(18 rules)
CRIME:= ẩu đả (brawl)
băng cướp (bandits)
bị đâm chết (stabbed) (90 rules)
ACCIDENT:= đâm xe (car crash)
cán chết (crashed)
lật tàu (boat capsized) (27 rules)
DAMAGE:= thiệt mạng (die)
chết thảm (pitiful death)
chấn thương sọ não (brain injury) (22 rules)
To pull out time when event happened, we reach two methods: direct and indirect The former is in situation that the time is showed completely by circumstance’s content,
we use regular expressions to accomplish this task The latter comes when the time is not concreate For instance,
"Hôm nay, hai vụ tai nạn giao thông đã xảy ra trên đường Khuất Duy Tiến." (" To-day, two transport accidents happened on Khuat Duy Tien
Trang 4Street.") In this example, Hôm nay (Today) is a relative
adverb that does not denote the time exactly when the event
occured We solved this problem by matching based on a
dictionary which contains relative key and relative value as
definition below Then, rule 5 is used to extract time
RELAT IV ET IME = (RELAT IV E_W ORD, BIAS)
= {(hôm nay (today), 0), (rạng sáng nay (this morning), 0) , (hôm qua (yesterday), −1),
(hai ngày trước (two days ago),
− 2)}
< DAT E >=< P UBLISH · T IME > + < BIAS >
(5) Next step, we extract location where event occured As
mentioned in rule 6, we used two constituents to find out
proper location The first is LOCPREP, which is a set of
prepostions coming before right place; and the second is
LOCPREFIX, which is a set of prefixes coming after
prepo-sitions above but coming before right place After, we
ap-plied the rule 6 to perform this task
LOCP REP = {ở (in), tại (at) , trên (in),
gần (nearby, near), trong (into) }
LOCP REF IX = {thành phố (city), tỉnh (province),
quận (district), thị xã (town),
xã (village), phố (street) } LOCAT ION = {loc i |loc i ∈ location dictionary}
< LOCP REP >< LOCP REF IX >< LOCAT ION >
(6) Finally, participants is considered As the same prior
event information, we also use rule that is shown at 7
< P RE > < P ERSON > (7)
PRE:= ông (Mr)
bà (Mrs/Ms)
gia đình (family)
nghi can (suspect)
bị cáo (defendant)
4 Experiment and Result
Our experiment process was conducted on data set that includes 18.400 titles which are extracted from 3.842.137 news titles of BAOMOI 4 through RSS news gathering News components are illustrated in table 2 Besides, we
Table 2: News’ elements
Title News’ headline
Abstract The short paragraph what summaries
news’ content
Publish time Time when the news is published
Maybe support for time extraction pro-cess
Link Link to origin news
have evaluated event detection via evaluate event
classifi-cation process by using cross validation (10 fold cross val-idation) Testing data set is separated to 10 testing patterns
with rate 9:1, 9 parts are used as training data set and 1 part used as testing data Result of classification is illustrated in table 3 and chart 2
Table 3: Result of classification
Precision Recall F1
Fold 1 92.70 89.23 90.93 Fold 2 93.08 91.39 92.23 Fold 3 93.32 91.54 92.42 Fold 4 93.32 91.54 92.42 Fold 5 93.68 91.78 92.72 Fold 6 93.50 91.60 92.54 Fold 7 92.95 90.81 91.87 Fold 8 92.39 89.01 90.67 Fold 9 91.81 88.65 90.20 Fold 10 91.68 88.51 90.07
After the system operates online at http://vnloc.com
(fig-ure 3), we evaluated result of event extraction process by manual task through checking each event is showed on sys-tem from 13/04/2012 to 22/04/2012 The statistics precision
of event extraction is showed in table 4 Based on articles detected that contain events, the statistics in table 4 presents that event extraction strategy using lexico–semantic and ma-chine learning is appropriate in Vietnamese news In some cases, extracting event process is false because it relates
to ambiguity of places where many locations have similar names whereas article does not mention position fully
4http://www.baomoi.com/
Trang 5Figure 3: VnLoc athttp://vnloc.com
Figure 2: Result of classification
In this paper, we have represented a method that
com-bines lexico–semantic and machine learning (Maximum
Entropy) for event extraction on Vietnamese domain data
and described VnLoc system
Through the result of experiment have demonstrated
combining lexico–semantic and machine learning will
achieve good result in Vietnamese domain data Maximum
Entropy machine learning method is used for binary
clas-sification and only keeping events that are suitable with
features in training data set Lexico–semantic is applied to
take out useful information of event Eventually, event’s
at-Table 4: Event extraction result (in quantity)
Date Extracted Correct Precision(%)
tributes are extracted based on rules and it is visualized on online map
Furthermore, we have described in detail the system ar-chitecture Especially, we concentrated describing activity
of Event Detector component which uses the method was proposed to recognize an event, and Event Extractor which uses lexico–semantic rules to extract event’s attributes Although we have achieved good result, system need to have some improvements to enhance quality in the future Firstly, the precision of Maximum Entropy classifier must
be enhanced the by adding useful information Secondly, we aim to expand some areas such as disaster (disease, earth-quake, tsunami), culture and finance Therefore, an
Trang 6ontol-ogy is building to integrates easier some plugin modules for purpose above
This research work was partly supported by The National Major Research Program KC.01/11-15 (code
KC.01.TN04/11-15) under project "Analyzing opinion’s trend based on social network and its application in tourism and technology products".
References
[1] J Allan, R Papka, and V Lavrenko On-line new event
detec-tion and tracking In Proceedings of the 21st annual interna-tional ACM SIGIR conference on Research and development
in information retrieval, SIGIR ’98, pages 37–45, New York,
NY, USA, 1998 ACM.
[2] J Borsje, F Hogenboom, and F Frasincar Semi–automatic financial events discovery based on lexico–semantic patterns.
Int J Web Eng Technol., 6(2):115–140, Jan 2010.
[3] K B Cohen, K Verspoor, H L Johnson, C Roeder, P V Ogren, W A Baumgartner, Jr., E White, H Tipney, and
L Hunter High-precision biological event extraction with
a concept recognizer In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, BioNLP ’09, pages 50–58, Stroudsburg, PA,
USA, 2009 Association for Computational Linguistics.
[4] U K F D J Frederik Hogenboom, Flavius Frasincar An
overview of event extraction from text Workhop on Detection, Representation, and Exploitation of Events in the Semantic Web, 2011.
[5] R Grishman and B Sundheim Message understanding
conference-6: a brief history In Proceedings of the 16th con-ference on Computational linguistics - Volume 1, COLING
’96, pages 466–471, Stroudsburg, PA, USA, 1996 Associa-tion for ComputaAssocia-tional Linguistics.
[6] H Ji and R Grishman Refining event extraction through
cross-document inference In Proc, 2008.
[7] D W Ralph Grishman and A Meyers Nyu’s english ace
2005 system description ACE Program, 2005.