DSpace at VNU: Extraction of Disease Events for a Real-time Monitoring System

Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining General Terms Data Mining; Information Extraction Keywords Data Mining; Information Extraction; Event Extrac

Trang 1

Extraction of Disease Events for a Real-time

Monitoring System

Minh-Tien Nguyen Hung Yen University of Technology and

Education (UTEHY)

Knowledge Technology Laboratory (KT-Lab)

tiennm@utehy.edu.vn

Tri-Thanh Nguyen Vietnam National University, Hanoi (VNUH), University of Engineering and Technology (UET) Knowledge Technology Laboratory (KT-Lab)

ntthanh@vnu.edu.vn ABSTRACT

In this paper, we propose a method that uses both

seman-tic rules and machine learning to extract infectious disease

events in Vietnamese electronic news, which can be used

in a real-time system of monitoring the spread of diseases

Our method contains two important steps: detecting

dis-ease events from unstructured data and extracting

informa-tion of the disease events The event detecinforma-tion uses semantic

rules and machine learning to detect a disease event; in the

later step, Name Entity Recognition (NER), rules, and

dic-tionaries are used to capture the event’s information The

performance of detection step is ≈77,33% (F-score) and the

precision of extraction step is ≈91,89% These results are

better that those of the experiments in which rules were not

used This indicates that our method is suitable for

extract-ing disease events in Vietnamese text

Categories and Subject Descriptors

H.2.8 [Database Applications]: Data Mining

General Terms

Data Mining; Information Extraction

Keywords

Data Mining; Information Extraction; Event Extraction;

Dis-ease Event Extraction; Monitoring Systems

Information from electronic newspapers provide valuable

inputs for public health surveillance, early outbreak

detec-tion, and disease monitoring systems When the presence

of a disease is announced by the government and published

on a webpage, it is typically called disease event or an

in-fectious disease outbreak Unfortunately, the electronic

re-sources of infectious diseases are multidimensional, chaotic,

and not well organized, so extracting useful patterns from

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for profit or commercial advantage and that copies bear

this notice and the full citation on the first page Copyrights for components

of this work owned by others than ACM must be honored Abstracting with

credit is permitted To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior specific permission and/or a fee Request

permissions from Permissions@acm.org.

SoICT’13, December 05-06, 2013, Danang, Viet Nam

http://dx.doi.org/10.1145/2542050.2542084.

these sources is really challenging "How to detect an

infec-tious disease event?" and "how to extract information of an infectious disease event?" are two important questions which

are deeply focused on this paper

Disease detection and disease spreading/outbreak moni-toring are extremely meaningful issues in society, especially when the diseases are dangerous and have high ability of infection Because an infectious disease normally outbreaks

in a short time and spreads very quickly over a large area,

so it can bring to emergency circumstances not only for the citizens, but also for the government and economy There-fore, monitoring infectious disease outbreak is really crucial

in prevention, handing diseases and helping the authorities

to make suitable decisions

In this paper, we propose a model to automatically de-tect and extract information of human infectious disease events from Vietnamese webpages based on semantic rules and machine learning The model includes two important components: disease event detection and disease event ex-traction In the first component, an infectious disease event

is detected from free text, after that, the information of an event (time, disease name, and locations) is extracted in the second component Subsequently, we combine the extracted information to form an infectious disease event This infec-tious disease events can be the input for our monitoring system for visualization

Our paper is organized as follows: related work is in Sec-tion 2; our method will be discussed in SecSec-tion 3 in which event detection is mentioned in Section 3.3 and event extrac-tion is in Secextrac-tion 3.4 Secextrac-tion 4 gives experiments, results, and explains the source of some errors appearing in our re-search The last section is conclusion

Event extraction was first introduced as an important topic in 1987 in Message Understanding Conference (MUC)

[11] In MUC, an event is defined as: "an event must have

actor, time, place and impact on the surrounding environ-ment" Later, in Automatic Content Extraction (ACE)

pro-gram, Doddington George R., et al gave an event definition:

"an event is an activity that was created by participants" and

divided events into eight types: Life, Movement, Transac-tion, Business, Conflict, Contact, Personnel and Justice [7] Moreover, as Allan J., et al stated, an event includes four

attributes: modality, polarity (Positive, Negative), genericity

(Specific, Generic), and tense (Past, Present, Future, Un-specified) [10] Grishman R., et al gave the definition of a

disease event as a template: Disease Name, Date, Location,

Trang 2

Victim Number, Victim Descriptor, Victim Status, Victim

Type, Parent Event [9].

Hogenboom F., et al provided a general guideline on how

to select a suitable method for event extraction purpose [2]

The guideline indicated that event extraction approaches

can be listed as data-driven, knowledge-driven, and hybrid

Each approach has both advantages and disadvantages

Hogen-boom F., et al compared the benefits and drawbacks among

these methods Finally, the authors pointed out the hybrid

approach prevails

Event extraction from unstructured text can be applied

in many fields, especially in disease domain Grishman R.,

et al used linguistic event patterns (120 patterns) to

ana-lyze sentences to capture information of a disease event [9]

These linguistic patterns were built on word classes and

re-lation among them For example, pattern "np (DISEASE)

vg (KILL) np (VICTIM)" will match a clause like "Cholera

killed 23 inhabitants" An event is recognized based on the

trigger of two noun phrases: "outbreak of " and "people

died from " These patterns were applied to extract disease

events and achieved F-score of ≈53,98% Normally, applying

linguistic patterns can achieve high results if these patterns

cover the whole dataset, but preparing these patterns is

al-ways time-consuming and requires domain experts

More-over, the patterns must be changed when the data fluctuate

Finally, because the patterns were built on word classes, so

the authors must identify word classes (e.g., noun phrase,

verb phrase, etc.), but in some other languages (e.g.,

Viet-namese or Chinese), this is more challenging Because of this

drawback, we do not follow this approach

Volkova S., et al mixed entity recognition and sentence

classification to extract animal disease events [4] The event

recognition consists of three main steps: the first step is

entity recognition from unstructured texts; secondly,

sen-tences are classified based on these entities; finally, the

enti-ties within an event sentence are combined into a structured

tuple In the event recognition, true events should contain

a disease name and a disease-related verb The authors got

the precision of 75% and 65% in event tuple recognition

and the sentence classification, correspondingly, with the

features from WordNet and Google-Set corpus However,

us-ing a list of verbs to confirm an event can badly affect the

event extraction in Vietnamese language because the lacking

of resources for Natural Language Processing (NLP) (such

as Vietnamese WordNet or Google-Set like corpus for

Viet-namese) or the performance of parsing utility is not high

enough Thus, we do not use this method

Doan S, et al built a Global Health Monitor system

which shows the disease spreading state around the world

[5] The system includes three main steps: topic

classifica-tion, Named Entity Recognition (NER), and disease/location

detection Na¨ıve Bayes classifier is used in topic classification

with the precision of ≈88,10%, and F-score was ≈76,97% in

entity recognition step with Support Vector Machine (SVM),

and the final step achieved the precision of ≈93,40% with

BioCaster Ontology However, there are some limitations

in this system The first limitation is the location

ambi-guity, because some locations are not mentioned clearly in

input data (they are only provinces/cities, lacking of country

name), then the system can’t recognize the location exactly

Furthermore, BioCaster system can’t detect new diseases or

locations that are not in the ontology

Our approach uses the advantages of both semantic

rule-based method and machine learning in two main compo-nents: event detection and event extraction In the event detection, while the semantic rules play the role of a data filter, the classification model distinguishes that a news arti-cle contains an event or not Because our rules are used as a filter, so it is simpler than those in the research of Grishman R., et al [9] A rule in our study is a short phrase which is composed of a noun phrase and a verb phrase instead of a complete sentence Moreover, we do not use a list of verbs

to confirm events as Volkova S., et al [4], because, typically, this method depends on the coverage of verbs and building these verbs always takes much time In the event extraction, our approach is similar to the method of Doan S., et al [5]

We use rules, a disease dictionary, a NER, and a location dictionary for extracting information of a disease event

In addition, there are several systems which extract events from online news Grishman R., et al built Proteus-BIO system where users can follow infectious diseases [8] Data in this system are collected from webpages and disease reports from World Health Organization1and ProMed2 Collier N.,

et al made BioCaster system3 which follows several event types, especially disease events around the world Similarly, HealthMap 4 was built by Freifeld Clark C., et al where users can monitor disease all over the world [6]

DETEC-TION AND EXTRACDETEC-TION 3.1 Infectious Disease Event Characteristics

An investigation on our data domain indicates that an infectious disease event may contain a disease name, time, locations, and victims In some cases, it may have additional information such as the methods or the environment of in-fection Though Grishman R., et al [9] used a disease name, the time and the location of the outbreak, the number of af-fected victims, and the type of victims as the information

of a disease event, we only focus on three basic information: the time, locations of the outbreak and the infectious name disease We ignore the methods or environment information because we collect data from webpages instead of medical re-ports, so such information is not clearly mentioned in most cases Moreover, an event in MUC must include an actor [11], in our study, the actor is equivalent to a disease, there-fore we use the disease name instead of the actor

In addition, a closer examination on disease news arti-cles showed that a disease name is sometimes similar to a symptom, so this is one of the reasons of confusion in the event extraction For example, ‘pneumonia’ is the symptom

of ‘bird flu’ (A/H5N1), but it was recognized as a disease in some cases

3.2 Problem Definition The infectious disease event extraction problem can be defined as follows:

Input: a news article.

Output: whether the news article contains an infectious

disease event or not? If yes, extract information of the event

In our research, an infectious disease event E is defined as

1 http://www.who.int/csr/don/en/

2 http://www.promedmail.org/

3http://born.nii.ac.jp 4

http://www.healthmap.org

Trang 3

Figure 1: Steps of disease event extraction

 









 



Figure 2: Event detector components









a tuple that has three elements:

E = <name, time, place> (1)

where name is the name of the infectious disease mentioned

in the disease news article; time is the time when the disease

outbreaks; and place is a set of locations where the disease

appears

We propose a process to extract the information of a

dis-ease event as illustrated in Figure 1 The extraction

pro-cess includes five components: The crawler retrieves data

from the Internet; the pre-processing component extracts

the main content from the web pages returned by the crawler

(the detail of this module is described in Section and Table

3); the event detector decides whether a news article

con-taining a disease event or not; the event extractor captures

the information of the event in a given news article (if any);

finally, the visualization component plots the disease events

on an online Geographic Information System (GIS) map

In this paper, we strongly focus on two key components:

event detector and event extractor that are described in

de-tail in Section 3.3 and Section 3.4

3.3 Event Detector

The goal of Event Detector is to judge whether there is

a disease event from a given news article When a news

article is given, it determines whether it contains a disease

event (EVENT) or not (NOT_EVENT) by using rules (for

title filtering) and machine learning (for classification) The

process of event detector is illustrated in Figure 2

Event detector component consists of two modules: a data

filter and a classifier The filter module receives data from

the pre-process component where HTML tags are removed

to get the main content After that, this module filters

dis-ease news articles by checking their titles Subsequently,

data is transferred into the classifier which distinguishes that

a news article contains an event or not

3.3.1 Filtering Rules

As we mentioned above, the event detection component

has two modules: a data filter and a classifier, in which the

filter uses semantic rules to reduce news articles for later

classification We examined the domain data carefully and

identified that most of news titles express their main

con-tent It means that the title of a news article has enough

evidence to trigger the existence of a disease event

There-fore, we use rules to filter related disease news articles

Table 1: List of frequent-words

We carried out a statistic on a large dataset of news arti-cles from "Sức khỏe" (Health)5 category of "Báo mới" news website6 to find out a set of frequent words (and phrases) The number of frequent words is 34 and some of the most frequent words are given in Table 1, where the third column counts the number of articles containing the corresponding words in the second column We denote this set as Frequent-words set

We recognize that most of news articles contain words in the Frequent-words set relating to a disease event Therefore, our idea is to build semantic rules by combining words in the Frequent-words set for filtering input data purpose As the result, we proposed two patterns named Pattern 1 and Pattern 2 representing all our semantic rules These patterns are showed below:

Pattern 1 = noun phrase # verb phrase (2)

where noun phrase and verb phrase are in the

Frequent-words set

The Pattern 1 is illustrated in Example 1

Example 1:

bệnh nhân tử vong # nhiễm (died patient # infected) dịch tả # bùng phát (cholera # outbreaked)

Pattern 2 = disease name # verb phrases (3) where:

• disease name is retrieved from the BioCaster

Ontol-ogy [3] and The circular of the Ministry of Health of Vietnam7, dated June 24th, 2011;

• verb phrases are in the Frequent-words set.

An example of a sentence matching Pattern 2 is given in Example 2

Example 2:

tiêu chảy cấp # nhiễm (acute diarrhea # infected) tiêu chảy cấp # phát hiện (acute diarrhea # discovered) tiêu chảy cấp # lây lan (acute diarrhea # spread) tiêu chảy cấp # bùng phát (acute diarrhea # outbreaked) tiêu chảy cấp # chết (tử vong) (acute diarrhea # died) tiêu chảy cấp # dương tính (acute diarrhea # is positive)

Both the two patterns have two elements which are sepa-rated by the character "#" We built 43 rules from Pattern 1

by mixing 52 noun phrases and 10 verb phrases Both these 5

http://www.baomoi.com/Home/SucKhoe.epi

6http://www.baomoi.com 7

http://www.moh.gov.vn/

Trang 4

Table 2: List of features

No Feature

1 Dịch tay chân miệng (disease limbs)

2 Tiêu chảy (diarrhea)

3 Trẻ tử vong (the child died)

4 Ổ dịch (disease source)

5 Dương tính (is positive)

6 Dịch cúm gia cầm (bird flu)

7 Ca tử vong (deaths)

8 Bùng phát dịch (outbreak)

9 Dịch cúm (flu)

10 Bệnh nhân tử vong (the patient died)

noun phrases and verb phrases are in the Frequent-words set

Similarly, we used a disease name and a verb phrase to

cre-ate a rule following Pattern 2 With 186 disease names from

the disease dictionary and 6 verb phrases in the

Frequent-words set, the number of rules conforming to Pattern 2 is

186 Some verb phrases in Pattern 1 and Pattern 2 are the

same

After building the rules set, we had 229 rules in total The

related articles are retrieved by these rules and transferred

into the classifier

3.3.2 Machine Learning Application

The classification model categorizes a news article into

ei-ther EVENT or NOT_EVENT label The investigation on

input data suggests that the title and abstract of a disease

news article have enough information to represent its

con-tent, therefore these elements are used to create the feature

vector In the data preparation step, articles are manually

tagged with label (EVENT) and label (NOT_EVENT)

Af-ter that, features are generated by using 2-grams, 3-grams,

and 4-grams As the result, we retrieve 4,552 features which

are used for classification Some features are showed in Table

2

We used Maximum Entropy Model8as the classifier The

news articles which are labeled EVENT will become the

input for the Event Extractor component

3.4 Event Extractor

Event Extractor is one of two important components where

the information of a disease event is extracted The event

extraction component is illustrated in Figure 3

Event extraction includes three modules: time extraction,

disease extraction, and location extraction The first

mod-ule uses rmod-ules to extract the time information; the second

module utilizes a disease dictionary extracting the disease

information; and the final module combines NER and a

lo-cation dictionary to capture place information Finally, we

combine the extracted information to form a disease event

and store it in an event database

3.4.1 Time Extraction

The investigation on dataset suggests that time

informa-tion can be captured by rule and it is either absolute or

relative In the absolute case, the time has the format of

DD/MM/YYYY, so we use Regular Expression (RE) to

ex-tract it For the relative case, it always contains two

ele-ments: a prefix and the time The prefix is a set of words

8

http://www.cs.princeton.edu/maxent

Figure 3: Event extractor component

















that indicates relative time and the time is usually in the Vietnamese date form of DD/MM/YYYY Therefore, we use a rule [1] to calculate the absolute time The time rule

is showed in Formula (4)

TIME = <RELATIVE TIME> + <DATE TIME>

(4) where:

• RELATIVE TIME = vào (on), ngày (date), sáng

(morn-ing), hôm nay (today), sáng hôm nay (this morn(morn-ing), chiều (afternoon), hôm qua (yesterday), tối qua (yes-terday evening), rạng sáng (early morning), tháng (month)

• DATE TIME has the format of DD/MM/YYYY which

is either the date expressed in the article content or the published date

Example 3 and Example 4 illustrate the use of Regular Ex-pression and the time rule to extract the time information

Example 3:

“Ngày 12/03/2012, Bộ Y tế công bố dịch cúm A H5N1 đã tái phát tại Quảng Ngãi.” (On March 12 th, 2012, Ministry

of Health announced the A H5N1 flu had hit Quang Ngai)

Example 4:

“Sáng ngày 15/01/2012, Sở Y tế Hà Nội thông báo bệnh nhân đầu tiên nhiễm cúm A/H5N1 đã tử vong” (In the

morning of January 15th, 2012, Hanoi Health Department announced the first patient who had infected with A/H5N1 flu died)

The time information in Example 3 is captured by the Regular Expression while it is extracted by Formula (4) in Example 4 As the result, the time information in Example 3

is March 12 th , 2012, whereas it is In the morning of January

15 th , 2012 in Example 4.

3.4.2 Disease Extraction

Disease extraction is the second module which captures the disease name As we mentioned in Figure 1, the pre-processing component tokenizes and word-segments the con-tent of articles As the result, each article has a list of words These words are input for this module Disease extraction module uses a disease dictionary including 186 disease names for the extraction purpose

Trang 5

The extraction process can be described in two steps:

find-ing the longest phrase that can be a name candidate, and

matching the candidate with the original article to check

whether it is a correct name The finding process uses the

longest matching method to match a word (in an article)

with a disease name (from the disease name dictionary) If

a disease name contains a given word, then it is probably the

disease name candidate In the matching process, the

candi-date is checked whether it appears in the article to ensure it

is correct or not The correct candidate must appear in the

original article The disease extraction process is illustrated

through Example 5

Example 5:

“Dịch cúm A/H5N1 bùng phát tại Bến Tre” (A/H5N1 flu

outbreaks in Ben Tre)

After tokenizing and word-segmenting, we retrieve two

words related to disease: cúm (flu) and A/H5N1 The

find-ing step matches these words with the disease dictionary to

find out the longest word As the result, with the word of

cúm (flu), we retrieve three words: cúm (flu), cúm A/H5N1

(A/H5N1 flu), and cúm gia cầm (bird flu), while with the

word of A/H5N1, we only have one name: cúm A/H5N1

(A/H5N1 flu) In the later step, the matching process checks

these words against the original article to find out correct

result In this example, the longest item is cúm gia cầm

(bird flu), but it does not appear in Example 5 So this

dis-ease is ignored The second longest word is cúm A/H5N1

(A/H5N1 flu) and the matching process recognizes that it

is in the original article So, it is the correct disease name

and the value of the disease information is the cúm A/H5N1

(A/H5N1 flu)

3.4.3 Location Extraction

Building the final module is more challenging than two

previous ones because the ambiguity among locations In

fact, several places can have the same proper name (e.g.,

"Dong Hai" town is a location in both "Tra Vinh" and "Quang

Ninh" provinces) Therefore, in some cases, if a news articles

does not mention locations clearly, the place information can

be confused To deal with this issue, we combined NER and

a location dictionary to improve the performance of location

extraction

Location extraction process can be described in three steps:

NER, location extraction, and normalization Firstly, the

NER 9 was applied to detect location entities in a given

news article As the result, locations in the article are

la-beled by a pair of <LOC> and </LOC> tags Secondly,

we extract the locations based on these tags In the final

step, each location is normalized by looking up the location

dictionary which will be described in detail later

We used a location dictionary that is organized as a

tax-onomy which is showed in Figure 4, where:

• T is the abbreviation of the town

• C is the abbreviation of the commune

In this taxonomy, the highest level is the root node; level

1 represents 63 provinces; 692 districts are in level 2; and

11,101 towns and communes are represented by nodes in

the level 3 If a phrase inside the <LOC> and </LOC>

tag is matched with the value of a node, then current node

9

http://jvntextpro.sourceforge.net

Figure 4: The location dictionary taxonomy



  

     

     

is marked and complete location is the path from the cur-rent node to the root node Obviously, this organization is efficient to identify the relation between communes, towns, and provinces and helps to avoid the geo-ambiguity The ef-ficiency of the taxonomy is showed in Example 6

Example 6:

“Ngày 12/04/2013, Sở y tế Quảng Ngãi thông báo dịch

cúm A H5N1 đã bùng phát tại thị trấn Sông Vệ (On April

12th, 2013, Department of Health of Quang Ngai announced

a A H5N1 flu outbreak in Song Ve town).

This example mentions only the town where the A H5N1 flu outbreaks ("Song Ve" town), while the district and the province are absent In the process of location extraction, this sentence is parsed by the NER, and "Song Ve" is la-beled by the <LOC> and </LOC> tags, while "Quang Ngai" is recognized as the organization entity (<ORG>)

As the common way, after retrieving the location (inside

<LOC> and </LOC>tags), "Song Ve" should be the loca-tion informaloca-tion But "Song Ve" does not have enough infor-mation to become a real location on a GIS map, since it is not complete In order to solve this problem, we looked up

"Song Ve" in the location taxonomy When the node having this value is found, we traversed from this node to the root node in the taxonomy to extract the complete information (i.e "Song Ve" town, "Tu Nghia" district, and "Quang Ngai" province) This step is called location normalization Finally, the extracted time, disease name, and locations from the article are combined to create an infectious dis-ease event in which the set of locations found in this module comprises the place component of the event The event is stored in an event database which is used for the visualiza-tion component in a real-time monitoring system

4.1 Data Preparation Our data is retrieved from "Báo mới" news website10 be-cause "Báo mới" automatically crawls a large number of news articles (per day) from most of famous Vietnamese websites, hence, it is a good data source After crawling, we had a dataset (denoted as raw dataset) of 3,842,137 news articles Elements of a news article (after pre-processing step) are showed in Table 3

After crawling the data, we used Pattern 1 (2) and Pattern

2 (3) to filter and got a set of 1,668 disease related news arti-cles We denotes the set of 1,668 articles as Filtered dataset for later reference

In our study, experiments are conducted on two important components: Event Detector and Event Extractor, which are 10

http://www.baomoi.com/

Trang 6

Table 3: News article’s elements

Element Description

Title The title of the article

Abstract The short paragraph what summaries

articles’ content

Published time Time when the news is published It

supports for time extraction process

Content The content of the article

Table 4: The error rate of the data filter module

Incorrect articles Total Error Rate (%)

described in detail in Section 4.2 and Section 4.3

4.2 Event Detection Evaluation

As we mentioned above, the Event Detector has two

mod-ules named the data filter and the classifier Therefore, we

will evaluate performance of this component based on these

two modules

4.2.1 Data Filter Evaluation

The data filter is the first module in the event detection

which filters articles from the data crawler component As

we mentioned above, this module uses Pattern 1 (2) and

Pattern 2 (3) to filter articles, so the performance of this

module depends on the coverage of rules of the two

pat-terns Normally, we must evaluate the precision of Pattern

1 and Pattern 2 on the whole dataset (about 3,842,137 news

articles), but this approach is very costly because we have

to label them manually

To evaluate the performance of this module, we randomly

selected 486 articles from raw dataset to manually check the

error rate The error rate was calculated using the Formula

(5), and the results are showed in Table 4 The results show

that the error rate is high or the accuracy is low due to the

fact that it filtered all the articles related to diseases in which

a large number of articles did not present disease events

(the detail of this issue will be discussed in Section 4.4) We

accept this to gain high recall, and the overall performance

will be improved by subsequent phases

ErrorRate = #incorrect

where:

• #incorrect is the number of articles which are not

re-lated to disease

• total is total number of articles.

4.2.2 Classification Evaluation

We carried two experiments to evaluate performance of

the classification, namely, Experiment a which combines rules

and machine learning, and Experiment b which uses only

machine learning The measures used to evaluate the

per-formance of this modules are precision, recall, and F-score

based on the 10-fold cross validation

In the Experiment a, we randomly selected 686 articles

from the Filtered dataset and tagged them as EVENT or

Table 5: The comparison of Experiment a and

Ex-periment b

Experiment a Experiment b

Avg 75,07 79,76 77,33 72,35 77,84 74,97

NOT_EVENT We denoted this set as Experiment a dataset.

In the Experiment b, we selected 50 more articles from the raw dataset, and added them to Experiment a dataset to form Experiment b dataset.

After preparing the training dataset, we compare the per-formance of the two experiments The comparison of two

classifiers is showed in Table 5 where the results of

Experi-ment b are in three columns on the right, while the results

of the Experiment a are showed in three columns on the left.

The average of F-score in the two experiments indicates that

the F-score of classifier in the Experiment a is better than that of the Experiment b of ≈2,36% The difference between

two classifiers is not big because we added only 50

arti-cles into the Experiment a dataset The performance will be much better if we add more raw articles into Experiment b

dataset.

4.3 Event Extraction Evaluation

Because an infectious disease event E is defined as a tuple

that includes name, time, and place as given in Formula (1),

so a correct event should completely contain all 3 elements When the time of an event is not clearly mentioned in the text, we use the published date of the article as the time of the event In other cases, if a disease event does not include either a disease name or locations, then it is considered to

be a false event

To evaluate the precision of the event extraction step, we

carried out two experiments, namely, Experiment c viated as Expr c) which uses rules, and Experiment d (abbre-viated Expr d) which uses both rules and NER The dataset

used in both experiments is 152 news articles which were se-lected from the articles set returned by the event detector

We use three measures Precision (P), Recall (R), and F-score (F) to compare the performance of the two experi-ments These measures are denoted by Formula (6), (7), and (8) as following:

where:

• # correct is the number of correct disease events

• # incorrect is the incorrect disease events

Trang 7

Table 6: The comparison of Experiment c and

Exper-iment d

Name Correct Incorrect P R F

Expr c 127 25 83,55 92,02 87,58

Expr d 136 16 89,47 94,44 91,89

where:

• # correct is the number of correct disease events.

• # not_found is the number of disease events which

the model did not recognize

F = 2 × P × R

Based on the Formula (6), (7), and (8), we compare the

precision of Experiment c and Experiment d The

compari-son is showed in Table 6, where the second row is the result

of Experiment c whereas the third row is the result of

Exper-iment d In the ExperExper-iment c, the F-score is ≈87,58%, while

it is ≈91,89% in the later experiment The result shows that

the precision of Experiment d improves by ≈4,31% in

com-parison with that of Experiment c The cause of the

differ-ence between two experiments will be explained in the next

section

4.4 Error Analysis and Discussion

In the Event Detector component, the results in Table 4

suggest that there is confusion in the data filter module To

find out the cause of confusion, we manually checked articles

which were selected from the dataset used in Section 4.2.1

The analyzed results indicate that in cases of error, some

rules of Pattern 1 (2) and Pattern 2 (3) are not efficient to

filter articles The reason is that several topics can share a

verb For instance, verb phrase "tử vong" (die) may belong

to either disease or treatment topics If this verb appears

in an article, the data filter module considers this article

related to a disease event, however in fact, it is a treatment

topic as illustrated in Example 7

Example 7:

Uống thuốc hạ sốt sau 30 phút bệnh nhân tử vong (The

patient died after having had the fever medication for 30

minutes)

This sentence is captured by a rule of Patten 1 (2) of "bệnh

nhân # tử vong" (patient # died), but in fact, the cause of

death is related to the medication instead of a disease

Moreover, some rules of Pattern 2 (3) (which is a

combina-tion of a disease name and a verb phrase) confuse the disease

event with a topic related to a disease as showed in Example

8

Example 8:

"Phát hiện chủng virus mới gây bệnh tay chân miệng" (A

new strain of virus causing the hand, foot, and mouth disease

has been discovered)

The rule of Pattern 2 (3) of "tay chân miệng # phát hiện"

(hand, foot, and mouth # detect) captures this sentence,

but it mentions the discovery of a new virus strain instead

of a disease event

For Event Extractor component, the results in Table 6

in-dicate that the precision of Experiment d is ≈5,92% higher

than that of the Experiment c At first, we were surprised with the comparative result, because, the Experiment c uses

rule-based method to capture information of an event Nor-mally, using rules (knowledge-driven method) often gets highly accuracy

To find out the source of errors appearing in the event ex-traction, we manually checked the incorrect articles in the two experiments (mentioned in Section 4.3) The investi-gated results are showed in Table 7 and Table 8, respectively The statistic from Table 7 and Table 8 indicates that the cause of errors in both experiments originated from the lo-cation extraction and, in some cases, from the diseases

ex-traction In the Experiment c, we recognized that the rules

which are used to extract locations did not cover all cases In

a few cases, if the location information is abbreviated, then the rules can not recognize them as illustrated in Example 9

Example 9:

“Phát hiện một trường hợp bệnh nhân nhiễm cúm A H5N1

tại P.7, Q.8, TP HCM.” (We discovered a patient who

infected A H5N1 flu in ward 7, district 8, HCM city)

In this example, ward 7, district 8 and Ho Chi Minh city are

abbreviated as (P.7, Q.8, TP HCM ), therefore, the rules

can not recognize location information

In the Experiment d, the main cause that reduced the

pre-cision of location extraction is the performance of NER tool

In a few cases, the it did not detect locations exactly because the abbreviation of places in articles (similar to the rule-based method) In some other cases, the it mis-recognized a location as an organization as showed in Example 10

Example 10:

“Ngày 12/03/2012, dịch tiêu chảy cấp đã bùng phát tại

Hà Nội, Hải Phòng, Quảng Ninh, Bến Tre, Cần Thơ.”

(On December 3rd, 2012, cholera outbreaked in Hanoi, Hai

Phong, Quang Ninh, Ben Tre, and Can Tho).

In this example, Hanoi, "Hai Phong", "Quang Ninh", "Ben Tre", "Can Tho" are recognized as organizations (tagged with <ORG> and </ORG> pairs) which would be ignored during processing

In both Experiment c and Experiment d, some extracted

disease names were incorrect, because they are not in the disease dictionary Moreover, the disease dictionary contains some names which are equivalent to the symptoms of a dis-ease Thus it makes confusion for the disease extraction module For instance, in the Table 7, a disease name of A/H1N flu in the 89th article is detected as pneumonia, while pneumonia is a symptom of the A/H1N flu

In addition, there are some factors which have bad effect

to the event extraction Firstly, typo errors of the location in articles reduces the performance of the location extraction For instance, "Đắk Lắk" is written as "Đắc Lắc", but "Đắc Lắc" does not appear in the location dictionary Therefore, the location information can be missed Secondly, if some

locations are not described clearly such as “các huyện phía

Tây của tỉnh Bến Tre” (the western districts of "Ben Tre"

province), then the NER utility can not recognize them Finally, another important cause is the geo-ambiguity that reduces the precision of event extraction component In fact, one proper name can be named for several places, if the disease news articles do not mention the places clearly, the location information can be confused The geo-ambiguity is showed in Example 11:

Example 11:

Trang 8

Table 7: The errors in Experiment c (15 of 25 errors)

Correct Information Extracted Information

3 13 Ward 6, District 8, Ward 14, Ho Chi Minh City District 5, District 8, Ward 7, Binh Thanh District,

Hoc Mon District

Đinh

Nam Đinh

Thang, D’Dak Rong

NULL

Quan Ngoai, Lang Chanh village, Lang Mau villiage, and Nhan Ly

Tam Quan commune

“Ngày 05/10/2012, Sở Y tế Quảng Ninh thông báo đã phát

hiện vi khuẩn tả tại thị trấn Đông Hải” (On May 10 th, 2012,

Quang Ninh Department of Health announced the detection

of cholera in the Dong Hai town)

In this example, "Dong Hai" town is a location in both

"Tra Vinh" and "Quang Ninh" provinces, but the article only

mentions Dong Hai town, so the module failed to decide

whether the disease outbreak was in "Quang Ninh" or "Tra

Vinh"?

Another error source came from the incomplete

recogni-tion of locarecogni-tion, i.e only some parts of a locarecogni-tion was

de-tected as shown in row 4 of Table 7 (where only the Nam

Dinh province was detected), and row 11 of Table 8 (where

only Binh Duong province was recognized)

The last error source originated from the case in which a

location mentioned in the text was not the outbreak place

This made the location module misunderstand, and extract

the incorrect information as depicted in row 9 of Table 7

and row 8 of Table 8

In this paper, we introduced our method that combines

se-mantic rules and machine learning to extract disease events

in Vietnamese webpages The results of experiments

illus-trated that our method is suitable for extracting disease

events in the Vietnamese Furthermore, we have described

briefly our system process, especially we emphasize two key

components: Event Detector and Event Extractor We plan

to integrate the event database into Vn-Loc system11where

user can follow some event types: FIRE, CRIMINAL, and

TRANSPORT ACCIDENT.

However, our method needs to have some improvements

to enhance the performance in the future Firstly, the

cover-age of semantic rules and the performance of the Maximum

11

http://vnloc.com/

Entropy classifier must be enhanced by adding useful in-formation Secondly, the precision of event extraction can

be increased by improving the performance of NER tool Besides, the geo-ambiguity and the confusion between dis-eases and symptoms should be improved Finally, relations between disease events should be considered to enhance the quality of the monitoring system

[1] Mai-Vu Tran, Minh Hoang Nguyen, Sy-Quan Nguyen,

Minh-Tien Nguyen, and Xuan-Hieu Phan "VnLoc: A

Real - Time News Event Extraction Framework for Vietnamese". KSE, pp.161-166, 2012

[2] Hogenboom Frederik, et al "An Overview of Event

Extraction from Text", Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011) at Tenth International Semantic Web Conference (ISWC 2011) Vol 779 2011

[3] Collier Nigel, et al "An Ontology-driven System for

Detecting Global Health Events". In Proceedings of the 23rd International Conference on Computational Linguistics (pp 215-222) Association for

Computational Linguistics

[4] Volkova Svitlana, et al "Animal Disease Event

Recognition and Classification". Proceedings of the First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010) 2010

[5] Doan S., Hung-Ngo Q., Kawazoe A., and Collier N.,

"Global Health Monitor - a Web-based System for Detecting and Mapping Infectious Diseases". Proc International Joint Conference on Natural Language Processing (IJCNLP), Companion Volume,

Hyderabad, India, January 7-12, pp.951-956, 2008

[6] Freifeld Clark C., et al "HealthMap: Global Infectious

Trang 9

Table 8: The errors in Experiment d

Correct Information Extracted Information

Nhon City

Binh Dinh

5 25 4 village, Hoa An commune, Krong Pac district, Dak

Lak

Hoa An Commune, Chiem Hoa district, Tuyen Quang

6 26 Ward 8, District 5, Ho Chi Minh City (P.8, Q.5, TP

HCM)

NULL

Phu, Chau Thanh Ba Tri, Cho Lach

Ben Tre

Dinh, Hanoi

Hanoi, Vinh Phuc

Town, Binh Duong

Binh Duong

14 84 Thanh Binh Ward, Hai Chau district, Da Nang city,

Dak Lak

Ward Thanh Binh, Ninh Binh City, Ninh Binh City

Da Nang, Hai Chau District

Hoan Kiem District, Thanh Tri, Dong Da, Quang Ninh, Bac Giang, Nam Dinh, Thai Binh, Ha Nam, Hung Yen

Hanoi

Disease Monitoring through Automated Classification

and Visualization of Internet Media Reports".

Journal of the American Medical Informatics

Association 15.2 (2008): 150-157

[7] Doddington George R., et al "The Automatic Content

Extraction (ACE) Program – Tasks, Data, and

Evaluation". LREC 2004

[8] Grishman Ralph, Silja Huttunen, and Roman

Yangarber "Real-Time Event Extraction for Infectious

Disease Outbreaks". Proceedings of the second

international conference on Human Language

Technology Research Morgan Kaufmann Publishers

Inc., 2002

[9] Grishman Ralph, Silja Huttunen, and Roman

Yangarber "Information extraction for enhanced

access to disease outbreak reports". Journal of

Biomedical Informatics (JBI), Vol 35, No 4,

pp.236-246, 2002

[10] Allan James, Ron Papka, and Victor Lavrenko

"On-line new event detection and tracking".

Proceedings of the 21st annual international ACM

SIGIR conference on Research and development in

information retrieval ACM, 1998

[11] Grishman Ralph, and Beth Sundheim "Message

understanding conference-6: a brief history"

COLING, Vol 1, pp.466–471, 1996

Định dạng
Số trang	9
Dung lượng	778,71 KB