1. Trang chủ
  2. » Thể loại khác

DSpace at VNU: VnLoc: A real-time news event extraction framework for Vietnamese

6 145 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 6
Dung lượng 642,01 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

VnLoc: A Real–time News Event Extraction Framework for VietnameseMai-Vu Tran∗ vutm@vnu.edu.vn Minh-Hoang Nguyen∗ hoangnm_53@vnu.edu.vn Sy-Quan Nguyen∗ quanns_53@vnu.edu.vn Minh-Tien Nguy

Trang 1

VnLoc: A Real–time News Event Extraction Framework for Vietnamese

Mai-Vu Tran vutm@vnu.edu.vn

Minh-Hoang Nguyen hoangnm_53@vnu.edu.vn

Sy-Quan Nguyen quanns_53@vnu.edu.vn Minh-Tien Nguyen∗∗

tiennm@utehy.edu.vn

Xuan-Hieu Phan hieupx@vnu.edu.vn

KTLab, Faculty of Information Technology, College of Technology

Vietnam National University, Hanoi (VNU)

Hanoi, Vietnam

∗∗ Faculty of Information Technology, Hung Yen University of Technology and Education, Hungyen, Vietnam

Abstract

Event Extraction is a complex and interesting topic in

Information Extraction that includes event extraction

meth-ods from free text or web data The result of event extraction

systems can be used in several fields such as risk analysis

systems, online monitoring systems or decide support tools

[4] In this paper, we introduce a method that combines

lexico–semantic and machine learning to extract event from

Vietnamese news Furthermore, we concentrate to describe

event online monitoring system named VnLoc based on the

method that was proposed above to extract event in

Viet-namese language Besides, in experiment phase, we have

evaluated this method based on precision, recall and F1

measure At this time of experiment, we on investigated on

three types of event: FIRE, CRIME and TRANSPORT

AC-CIDENT.

1 Introduction

The information explosion and development of

Infor-mation Technology–Communication is good condition for

people reach information easily Therefore, information are

more and more rich and diversified Information from

dif-ferent sources (newspaper, blog, social network, ) is main

cause of chaos information Thus, extracting useful

infor-mation that reader interested in from the daily news is really

know news detail without reading entire news content In addition, result of extracting event can be used in online monitoring system where user can catch information eas-ily

Recently, event extraction topic has received more at-tention from scientists in Natural Language Processing and Data Mining around the world In 1987, event extraction have become a main topic in Message Understanding Con-ference (MUC) [5] In this conCon-ference, an event was

de-fined: an event must have actor, time, place and impact on the surrounding environment In addition, Automatic Con-tent Extraction program gave definition: event is an ac-tivity was created by participants and divided event into

eight types: Life, Movement, Transection, Business, Con-flict, Contact, Personnel and Justice As Allen’s definition:

an event includes four attributes: modality, polarity (Pos-itive, Negative), genericity (Specific, Generic), tense(Past, Present, Future, Unspecified) [1].

Based on investigation and analysis of meaning of event extraction, we have proposed a event extraction method for Vietnamese language and building event online monitoring system named VnLoc The method we proposed which is a combination between lexico–semantic and machine learn-ing Data of system are gathered from news through the RSS feeds Then, we apply our method which was proposed

to classify event into two categories: EVENT or NON-EVENT upon tiding’s title After that, we extract event at-tributes from events which are classified and the last result

2012 Fourth International Conference on Knowledge and Systems Engineering

Trang 2

whereas section 3 will describe more detail our method and

event online monitoring system VnLoc Section 4 illustrates

our experiment and evaluates result on real data The last

section is conclusion

2 Related Work

In [7], Ralph Grishman et al investigated on Maximum

Entropy to detect event They used three classifiers for

in-dividual task which are argument classifier, role classifier,

event classifier Moreover, event coreference is also solved

by another Maximum Entropy classifier using features such

as the event type, the event subtype, the anaphor anchor and

the distance between anaphor and anchor In other study,

Heng Ji and Ralph Grishman explored Maximum Entropy

to identify events of a separate type [6] It is a sentence–

level classifier which processes each sentence in the

docu-ment and attempt determine event type

Lexico–semantic patterns can be used for various

pur-poses in many domains Cohen and Verspoor et al [3]

ap-plied semantic rules as patterns to extract event in

biologi-cal area They divided biologibiologi-cal events into six types:

bind-ing, gene expression, localization, phosphorylation,

pro-tein catebolism and transcription Biological events are

ex-tracted through patterns which each pattern is a set of

se-mantic words In other word, Jethro Borsje et al [2]

pro-posed using lexico–semantic patterns to detect financial

event from RSS news feeds These patterns were organized

in financial ontology named OWL Each pattern has a triple

format and includes three elements: subject, a relation and

optional subject.

In addition, there are several systems which extract

events from online news in other domain Collier et

al made BioCaster system where we can follow several

event types around the world (http://born.nii.ac.jp)

Be-sides, HealthMap system was built by Freifeld and

Brown-stein where user can monitor diseases types over the

world (http://www.healthmap.org) By the way, Frontex

system was developed by Atkinson and Piskoski et al

(http://frontex.europa.eu) for monitor Europe agency.

3 Implemented System

3.1 System Architecture

VnLoc is an event monitoring system that is horizontal

scalable and distributed Its architect is illustrated in figure

1 VnLoc consists of six components: a scalable data

repos-itory, a news crawler, an event detector, an event extractor, a

plugin engine and a visualizer as web–based

We organize the data repository into three parts: a news

database, an event database and a data corpus The first one

Table 1: Features Description

Feature χ2weight Freqt Semantic label

chém (cutting) 0.70329136 240 CRIME

giết (kill) 0.6890592 530 CRIME

cháy (fire) 0.5872597 201 FIRE

gây tai nạn(crashed) 0.5106312 374 TRANSPORT ACCIDENT

stores details information of news that is gathered The sec-ond one stores events information which are extracted by event extractor Both of them are organized in MongoDB system to attained the important key: high scalability The last one embraces several corpora which support for both machine learning process and extraction process Next, the news crawler fetches tidings through RSS resources which are supplied by many websites such as VnExpress1, Viet-NamNet2, DanTri3 By the structure of RSS format, useful information can be extracted from individual feed by a XML parser and be saved in news database in data repository Fur-thermore, the visualizer is described as a Map where shows event on web interface Data is pulled from event database and pushed to Google Map API with some modifications and will be represented Following, two main elements in-cluding event detector and event extractor will be explained

to make VnLoc system clearer

3.2 Event Detector

When a news is gathered, it is determined by the event detector to detect event inner news To settle this task, we used a binary calssification approach which is Maximum Entropy method We examined domain carefully and iden-tified that the most of news’ titles express their content ev-idently Therefore, our problem is sentence level classifi-cation The first job, set of features is chose based onχ2

weight on offline data that is gathered before Simultane-ously, N–grams method is also utilized to select phrases as features In this paper, we choose Uni–gram, Bi–grams and Tri–grams as three phrase types Moreover, feature is tagged with a semantic label to enhance its meaning The table 1 shows some examples After that, Maximum Entropy clas-sifier is applied to divides set of titles into two categories: EVENT and NON-EVENT This job is pre–condition for event extractor in the next phase

3.3 Event Extractor

In the second important part, event and its information such as time, place, participants will be extracted from news

1www.vnexpress.net

2www.vietnamnet.vn

3www.dantri.com

Trang 3

Figure 1: VnLoc’s Architecture

which is predicted that contains circumstance by the event

detector Our approach is very clear and knowledge driven

A lot of rules would be generated and exploited on the

ru-mours which are passed from previous phase In this paper,

we use 7 types of rules for our aim

To take out event, the rules 1, 2, 3, 4 are applied:

• FIRE

< P RE > < F IRE > < P OST > (1)

• CRIME

< P RE > < CRIME > < P OST > (2)

• TRANSPORT ACCIDENT

< P RE > < ACCIDENT > < P OST > (3)

< P RE > < DAMAGE > < P OST > (4)

With PRE and POST are phrases or words surrounding

key-words

FIRE:= vụ cháy (fire)

bùng cháy (burning)

cháy rụi (burned)

(18 rules)

CRIME:= ẩu đả (brawl)

băng cướp (bandits)

bị đâm chết (stabbed) (90 rules)

ACCIDENT:= đâm xe (car crash)

cán chết (crashed)

lật tàu (boat capsized) (27 rules)

DAMAGE:= thiệt mạng (die)

chết thảm (pitiful death)

chấn thương sọ não (brain injury) (22 rules)

To pull out time when event happened, we reach two methods: direct and indirect The former is in situation that the time is showed completely by circumstance’s content,

we use regular expressions to accomplish this task The latter comes when the time is not concreate For instance,

"Hôm nay, hai vụ tai nạn giao thông đã xảy ra trên đường Khuất Duy Tiến." (" To-day, two transport accidents happened on Khuat Duy Tien

Trang 4

Street.") In this example, Hôm nay (Today) is a relative

adverb that does not denote the time exactly when the event

occured We solved this problem by matching based on a

dictionary which contains relative key and relative value as

definition below Then, rule 5 is used to extract time

RELAT IV ET IME = (RELAT IV E_W ORD, BIAS)

= {(hôm nay (today), 0), (rạng sáng nay (this morning), 0) , (hôm qua (yesterday), −1),

(hai ngày trước (two days ago),

− 2)}

< DAT E >=< P UBLISH · T IME > + < BIAS >

(5) Next step, we extract location where event occured As

mentioned in rule 6, we used two constituents to find out

proper location The first is LOCPREP, which is a set of

prepostions coming before right place; and the second is

LOCPREFIX, which is a set of prefixes coming after

prepo-sitions above but coming before right place After, we

ap-plied the rule 6 to perform this task

LOCP REP = {ở (in), tại (at) , trên (in),

gần (nearby, near), trong (into) }

LOCP REF IX = {thành phố (city), tỉnh (province),

quận (district), thị xã (town),

xã (village), phố (street) } LOCAT ION = {loc i |loc i ∈ location dictionary}

< LOCP REP >< LOCP REF IX >< LOCAT ION >

(6) Finally, participants is considered As the same prior

event information, we also use rule that is shown at 7

< P RE > < P ERSON > (7)

PRE:= ông (Mr)

(Mrs/Ms)

gia đình (family)

nghi can (suspect)

bị cáo (defendant)

4 Experiment and Result

Our experiment process was conducted on data set that includes 18.400 titles which are extracted from 3.842.137 news titles of BAOMOI 4 through RSS news gathering News components are illustrated in table 2 Besides, we

Table 2: News’ elements

Title News’ headline

Abstract The short paragraph what summaries

news’ content

Publish time Time when the news is published

Maybe support for time extraction pro-cess

Link Link to origin news

have evaluated event detection via evaluate event

classifi-cation process by using cross validation (10 fold cross val-idation) Testing data set is separated to 10 testing patterns

with rate 9:1, 9 parts are used as training data set and 1 part used as testing data Result of classification is illustrated in table 3 and chart 2

Table 3: Result of classification

Precision Recall F1

Fold 1 92.70 89.23 90.93 Fold 2 93.08 91.39 92.23 Fold 3 93.32 91.54 92.42 Fold 4 93.32 91.54 92.42 Fold 5 93.68 91.78 92.72 Fold 6 93.50 91.60 92.54 Fold 7 92.95 90.81 91.87 Fold 8 92.39 89.01 90.67 Fold 9 91.81 88.65 90.20 Fold 10 91.68 88.51 90.07

After the system operates online at http://vnloc.com

(fig-ure 3), we evaluated result of event extraction process by manual task through checking each event is showed on sys-tem from 13/04/2012 to 22/04/2012 The statistics precision

of event extraction is showed in table 4 Based on articles detected that contain events, the statistics in table 4 presents that event extraction strategy using lexico–semantic and ma-chine learning is appropriate in Vietnamese news In some cases, extracting event process is false because it relates

to ambiguity of places where many locations have similar names whereas article does not mention position fully

4http://www.baomoi.com/

Trang 5

Figure 3: VnLoc athttp://vnloc.com

Figure 2: Result of classification

In this paper, we have represented a method that

com-bines lexico–semantic and machine learning (Maximum

Entropy) for event extraction on Vietnamese domain data

and described VnLoc system

Through the result of experiment have demonstrated

combining lexico–semantic and machine learning will

achieve good result in Vietnamese domain data Maximum

Entropy machine learning method is used for binary

clas-sification and only keeping events that are suitable with

features in training data set Lexico–semantic is applied to

take out useful information of event Eventually, event’s

at-Table 4: Event extraction result (in quantity)

Date Extracted Correct Precision(%)

tributes are extracted based on rules and it is visualized on online map

Furthermore, we have described in detail the system ar-chitecture Especially, we concentrated describing activity

of Event Detector component which uses the method was proposed to recognize an event, and Event Extractor which uses lexico–semantic rules to extract event’s attributes Although we have achieved good result, system need to have some improvements to enhance quality in the future Firstly, the precision of Maximum Entropy classifier must

be enhanced the by adding useful information Secondly, we aim to expand some areas such as disaster (disease, earth-quake, tsunami), culture and finance Therefore, an

Trang 6

ontol-ogy is building to integrates easier some plugin modules for purpose above

This research work was partly supported by The National Major Research Program KC.01/11-15 (code

KC.01.TN04/11-15) under project "Analyzing opinion’s trend based on social network and its application in tourism and technology products".

References

[1] J Allan, R Papka, and V Lavrenko On-line new event

detec-tion and tracking In Proceedings of the 21st annual interna-tional ACM SIGIR conference on Research and development

in information retrieval, SIGIR ’98, pages 37–45, New York,

NY, USA, 1998 ACM.

[2] J Borsje, F Hogenboom, and F Frasincar Semi–automatic financial events discovery based on lexico–semantic patterns.

Int J Web Eng Technol., 6(2):115–140, Jan 2010.

[3] K B Cohen, K Verspoor, H L Johnson, C Roeder, P V Ogren, W A Baumgartner, Jr., E White, H Tipney, and

L Hunter High-precision biological event extraction with

a concept recognizer In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, BioNLP ’09, pages 50–58, Stroudsburg, PA,

USA, 2009 Association for Computational Linguistics.

[4] U K F D J Frederik Hogenboom, Flavius Frasincar An

overview of event extraction from text Workhop on Detection, Representation, and Exploitation of Events in the Semantic Web, 2011.

[5] R Grishman and B Sundheim Message understanding

conference-6: a brief history In Proceedings of the 16th con-ference on Computational linguistics - Volume 1, COLING

’96, pages 466–471, Stroudsburg, PA, USA, 1996 Associa-tion for ComputaAssocia-tional Linguistics.

[6] H Ji and R Grishman Refining event extraction through

cross-document inference In Proc, 2008.

[7] D W Ralph Grishman and A Meyers Nyu’s english ace

2005 system description ACE Program, 2005.

Ngày đăng: 16/12/2017, 02:57

TỪ KHÓA LIÊN QUAN